2010-01-13 01:38:00

by Mathieu Desnoyers

[permalink] [raw]
Subject: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

Here is an implementation of a new system call, sys_membarrier(), which
executes a memory barrier on all threads of the current process.

It aims at greatly simplifying and enhancing the current signal-based
liburcu userspace RCU synchronize_rcu() implementation.
(found at http://lttng.org/urcu)

Changelog since v1:

- Only perform the IPI in CONFIG_SMP.
- Only perform the IPI if the process has more than one thread.
- Only send IPIs to CPUs involved with threads belonging to our process.
- Adaptative IPI scheme (single vs many IPI with threshold).
- Issue smp_mb() at the beginning and end of the system call.

Changelog since v2:
- simply send-to-many to the mm_cpumask. It contains the list of processors we
have to IPI to (which use the mm), and this mask is updated atomically.

Changelog since v3a:
- Confirm that each CPU indeed runs the current task's ->mm before sending an
IPI. Ensures that we do not disturn RT tasks in the presence of lazy TLB
shootdown.
- Document memory barriers needed in switch_mm().
- Surround helper functions with #ifdef CONFIG_SMP.

Changelog since v4:
- Add "int expedited" parameter, use synchronize_sched() in this case.
- Check num_online_cpus() == 1, quickly return without doing nothing.

Both the signal-based and the sys_membarrier userspace RCU schemes
permit us to remove the memory barrier from the userspace RCU
rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
accelerating them. These memory barriers are replaced by compiler
barriers on the read-side, and all matching memory barriers on the
write-side are turned into an invokation of a memory barrier on all
active threads in the process. By letting the kernel perform this
synchronization rather than dumbly sending a signal to every process
threads (as we currently do), we diminish the number of unnecessary wake
ups and only issue the memory barriers on active threads. Non-running
threads do not need to execute such barrier anyway, because these are
implied by the scheduler context switches.

To explain the benefit of this scheme, let's introduce two example threads:

Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
Thread B (frequent, e.g. executing liburcu rcu_read_lock()/rcu_read_unlock())

In a scheme where all smp_mb() in thread A synchronize_rcu() are
ordering memory accesses with respect to smp_mb() present in
rcu_read_lock/unlock(), we can change all smp_mb() from
synchronize_rcu() into calls to sys_membarrier() and all smp_mb() from
rcu_read_lock/unlock() into compiler barriers "barrier()".

Before the change, we had, for each smp_mb() pairs:

Thread A Thread B
prev mem accesses prev mem accesses
smp_mb() smp_mb()
follow mem accesses follow mem accesses

After the change, these pairs become:

Thread A Thread B
prev mem accesses prev mem accesses
sys_membarrier() barrier()
follow mem accesses follow mem accesses

As we can see, there are two possible scenarios: either Thread B memory
accesses do not happen concurrently with Thread A accesses (1), or they
do (2).

1) Non-concurrent Thread A vs Thread B accesses:

Thread A Thread B
prev mem accesses
sys_membarrier()
follow mem accesses
prev mem accesses
barrier()
follow mem accesses

In this case, thread B accesses will be weakly ordered. This is OK,
because at that point, thread A is not particularly interested in
ordering them with respect to its own accesses.

2) Concurrent Thread A vs Thread B accesses

Thread A Thread B
prev mem accesses prev mem accesses
sys_membarrier() barrier()
follow mem accesses follow mem accesses

In this case, thread B accesses, which are ensured to be in program
order thanks to the compiler barrier, will be "upgraded" to full
smp_mb() thanks to the IPIs executing memory barriers on each active
system threads. Each non-running process threads are intrinsically
serialized by the scheduler.

For my Intel Xeon E5405
(one thread is doing the sys_membarrier, the others are busy looping)

* expedited

10,000,000 sys_membarrier calls:

T=1: 0m20.173s
T=2: 0m20.506s
T=3: 0m22.632s
T=4: 0m24.759s
T=5: 0m26.633s
T=6: 0m29.654s
T=7: 0m30.669s

For a 2-3 microseconds/call.

* non-expedited

1000 sys_membarrier calls:

T=1-7: 0m16.002s

For a 16 milliseconds/call. (~5000-8000 times slower than expedited)

The expected top pattern for the expedited scheme, when using 1 CPU for a thread
doing sys_membarrier() in a loop and other 7 threads busy-waiting in user-space
on a variable shows that the thread doing sys_membarrier is doing mostly system
calls, and other threads are mostly running in user-space. Note that IPI
handlers are not taken into account in the cpu time sampling.

Cpu0 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 99.7%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.3%hi, 0.0%si, 0.0%st
Cpu2 : 99.3%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.7%hi, 0.0%si, 0.0%st
Cpu3 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu4 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu5 : 96.0%us, 1.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 2.6%si, 0.0%st
Cpu6 : 1.3%us, 98.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu7 : 96.1%us, 3.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.3%hi, 0.3%si, 0.0%st

Results in liburcu:

Operations in 10s, 6 readers, 2 writers:

(what we previously had)
memory barriers in reader: 973494744 reads, 892368 writes
signal-based scheme: 6289946025 reads, 1251 writes

(what we have now, with dynamic sys_membarrier check, expedited scheme)
memory barriers in reader: 907693804 reads, 817793 writes
sys_membarrier scheme: 4316818891 reads, 503790 writes

(dynamic sys_membarrier check, non-expedited scheme)
memory barriers in reader: 907693804 reads, 817793 writes
sys_membarrier scheme: 8698725501 reads, 313 writes

So the dynamic sys_membarrier availability check adds some overhead to the
read-side, but besides that, with the expedited scheme, we can see that we are
close to the read-side performance of the signal-based scheme and also close
(5/8) to the performance of the memory-barrier write-side. We have a write-side
speedup of 400:1 over the signal-based scheme by using the sys_membarrier system
call. This allows a 4.5:1 read-side speedup over the memory barrier scheme.

The non-expedited scheme adds indeed a much lower overhead on the read-side
both because we do not send IPIs and because we perform less updates, which in
turn generates less cache-line exchanges. The write-side latency becomes even
higher than with the signal-based scheme. The advantage of the non-expedited
sys_membarrier() scheme over signal-based scheme is that it does not require to
wake up all the process threads.

The system call number is only assigned for x86_64 in this RFC patch. Note that
switch_mm() memory barrier audit is required for each architecture before
assigning a system call number.

Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: "Paul E. McKenney" <[email protected]>
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
---
arch/x86/include/asm/mmu_context.h | 18 +++++-
arch/x86/include/asm/unistd_64.h | 2
kernel/sched.c | 111 +++++++++++++++++++++++++++++++++++++
3 files changed, 129 insertions(+), 2 deletions(-)

Index: linux-2.6-lttng/arch/x86/include/asm/unistd_64.h
===================================================================
--- linux-2.6-lttng.orig/arch/x86/include/asm/unistd_64.h 2010-01-12 10:25:47.000000000 -0500
+++ linux-2.6-lttng/arch/x86/include/asm/unistd_64.h 2010-01-12 10:25:57.000000000 -0500
@@ -661,6 +661,8 @@ __SYSCALL(__NR_pwritev, sys_pwritev)
__SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
#define __NR_perf_event_open 298
__SYSCALL(__NR_perf_event_open, sys_perf_event_open)
+#define __NR_membarrier 299
+__SYSCALL(__NR_membarrier, sys_membarrier)

#ifndef __NO_STUBS
#define __ARCH_WANT_OLD_READDIR
Index: linux-2.6-lttng/kernel/sched.c
===================================================================
--- linux-2.6-lttng.orig/kernel/sched.c 2010-01-12 10:25:47.000000000 -0500
+++ linux-2.6-lttng/kernel/sched.c 2010-01-12 14:33:20.000000000 -0500
@@ -10822,6 +10822,117 @@ struct cgroup_subsys cpuacct_subsys = {
};
#endif /* CONFIG_CGROUP_CPUACCT */

+#ifdef CONFIG_SMP
+
+/*
+ * Execute a memory barrier on all active threads from the current process
+ * on SMP systems. Do not rely on implicit barriers in IPI handler execution,
+ * because batched IPI lists are synchronized with spinlocks rather than full
+ * memory barriers. This is not the bulk of the overhead anyway, so let's stay
+ * on the safe side.
+ */
+static void membarrier_ipi(void *unused)
+{
+ smp_mb();
+}
+
+/*
+ * Handle out-of-mem by sending per-cpu IPIs instead.
+ */
+static void membarrier_retry(void)
+{
+ struct mm_struct *mm;
+ int cpu;
+
+ for_each_cpu(cpu, mm_cpumask(current->mm)) {
+ spin_lock_irq(&cpu_rq(cpu)->lock);
+ mm = cpu_curr(cpu)->mm;
+ spin_unlock_irq(&cpu_rq(cpu)->lock);
+ if (current->mm == mm)
+ smp_call_function_single(cpu, membarrier_ipi, NULL, 1);
+ }
+}
+
+#endif /* #ifdef CONFIG_SMP */
+
+/*
+ * sys_membarrier - issue memory barrier on current process running threads
+ * @expedited: (0) Lowest overhead. Few milliseconds latency.
+ * (1) Few microseconds latency.
+ *
+ * Execute a memory barrier on all running threads of the current process.
+ * Upon completion, the caller thread is ensured that all process threads
+ * have passed through a state where memory accesses match program order.
+ * (non-running threads are de facto in such a state)
+ *
+ * mm_cpumask is used as an approximation. It is a superset of the cpumask to
+ * which we must send IPIs, mainly due to lazy TLB shootdown. Therefore,
+ * we check each runqueue to make sure our ->mm is indeed running on them. This
+ * reduces the risk of disturbing a RT task by sending unnecessary IPIs. There
+ * is still a slight chance to disturb an unrelated task, because we do not lock
+ * the runqueues while sending IPIs, but the real-time effect of this heavy
+ * locking would be worse than the comparatively small disruption of an IPI.
+ *
+ * RED PEN: before assinging a system call number for sys_membarrier() to an
+ * architecture, we must ensure that switch_mm issues full memory barriers (or a
+ * synchronizing instruction having the same effect) between:
+ * - user-space code execution and clear mm_cpumask.
+ * - set mm_cpumask and user-space code execution.
+ * In some case adding a comment to this effect will suffice, in others we will
+ * need to add smp_mb__before_clear_bit()/smp_mb__after_clear_bit() or simply
+ * smp_mb(). These barriers are required to ensure we do not _miss_ a CPU that
+ * need to receive an IPI, which would be a bug.
+ *
+ * On uniprocessor systems, this system call simply returns 0 without doing
+ * anything, so user-space knows it is implemented.
+ */
+SYSCALL_DEFINE1(membarrier, int, expedited)
+{
+#ifdef CONFIG_SMP
+ cpumask_var_t tmpmask;
+ struct mm_struct *mm;
+ int cpu;
+
+ if (unlikely(thread_group_empty(current) || (num_online_cpus() == 1)))
+ return 0;
+ if (!unlikely(expedited)) {
+ synchronize_sched();
+ return 0;
+ }
+ /*
+ * Memory barrier on the caller thread _before_ sending first
+ * IPI. Matches memory barriers around mm_cpumask modification in
+ * switch_mm().
+ */
+ smp_mb();
+ if (!alloc_cpumask_var(&tmpmask, GFP_KERNEL)) {
+ membarrier_retry();
+ goto unlock;
+ }
+ cpumask_copy(tmpmask, mm_cpumask(current->mm));
+ preempt_disable();
+ cpumask_clear_cpu(smp_processor_id(), tmpmask);
+ for_each_cpu(cpu, tmpmask) {
+ spin_lock_irq(&cpu_rq(cpu)->lock);
+ mm = cpu_curr(cpu)->mm;
+ spin_unlock_irq(&cpu_rq(cpu)->lock);
+ if (current->mm != mm)
+ cpumask_clear_cpu(cpu, tmpmask);
+ }
+ smp_call_function_many(tmpmask, membarrier_ipi, NULL, 1);
+ preempt_enable();
+ free_cpumask_var(tmpmask);
+unlock:
+ /*
+ * Memory barrier on the caller thread _after_ we finished
+ * waiting for the last IPI. Matches memory barriers around mm_cpumask
+ * modification in switch_mm().
+ */
+ smp_mb();
+#endif /* #ifdef CONFIG_SMP */
+ return 0;
+}
+
#ifndef CONFIG_SMP

int rcu_expedited_torture_stats(char *page)
Index: linux-2.6-lttng/arch/x86/include/asm/mmu_context.h
===================================================================
--- linux-2.6-lttng.orig/arch/x86/include/asm/mmu_context.h 2010-01-12 10:59:31.000000000 -0500
+++ linux-2.6-lttng/arch/x86/include/asm/mmu_context.h 2010-01-12 11:59:49.000000000 -0500
@@ -36,6 +36,11 @@ static inline void switch_mm(struct mm_s
unsigned cpu = smp_processor_id();

if (likely(prev != next)) {
+ /*
+ * smp_mb() between user-space thread execution and
+ * mm_cpumask clear is required by sys_membarrier().
+ */
+ smp_mb__before_clear_bit();
/* stop flush ipis for the previous mm */
cpumask_clear_cpu(cpu, mm_cpumask(prev));
#ifdef CONFIG_SMP
@@ -43,7 +48,11 @@ static inline void switch_mm(struct mm_s
percpu_write(cpu_tlbstate.active_mm, next);
#endif
cpumask_set_cpu(cpu, mm_cpumask(next));
-
+ /*
+ * smp_mb() between mm_cpumask set and user-space thread
+ * execution is required by sys_membarrier(). Implied by
+ * load_cr3.
+ */
/* Re-load page tables */
load_cr3(next->pgd);

@@ -59,9 +68,14 @@ static inline void switch_mm(struct mm_s
BUG_ON(percpu_read(cpu_tlbstate.active_mm) != next);

if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next))) {
- /* We were in lazy tlb mode and leave_mm disabled
+ /*
+ * We were in lazy tlb mode and leave_mm disabled
* tlb flush IPI delivery. We must reload CR3
* to make sure to use no freed page tables.
+ *
+ * smp_mb() between mm_cpumask set and user-space
+ * thread execution is required by sys_membarrier().
+ * Implied by load_cr3.
*/
load_cr3(next->pgd);
load_LDT_nolock(&next->context);
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68


2010-01-13 03:23:51

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

Hi

Interesting patch :)

I have few comments.

> Index: linux-2.6-lttng/kernel/sched.c
> ===================================================================
> --- linux-2.6-lttng.orig/kernel/sched.c 2010-01-12 10:25:47.000000000 -0500
> +++ linux-2.6-lttng/kernel/sched.c 2010-01-12 14:33:20.000000000 -0500
> @@ -10822,6 +10822,117 @@ struct cgroup_subsys cpuacct_subsys = {
> };
> #endif /* CONFIG_CGROUP_CPUACCT */
>
> +#ifdef CONFIG_SMP
> +
> +/*
> + * Execute a memory barrier on all active threads from the current process
> + * on SMP systems. Do not rely on implicit barriers in IPI handler execution,
> + * because batched IPI lists are synchronized with spinlocks rather than full
> + * memory barriers. This is not the bulk of the overhead anyway, so let's stay
> + * on the safe side.
> + */
> +static void membarrier_ipi(void *unused)
> +{
> + smp_mb();
> +}
> +
> +/*
> + * Handle out-of-mem by sending per-cpu IPIs instead.
> + */
> +static void membarrier_retry(void)
> +{
> + struct mm_struct *mm;
> + int cpu;
> +
> + for_each_cpu(cpu, mm_cpumask(current->mm)) {
> + spin_lock_irq(&cpu_rq(cpu)->lock);
> + mm = cpu_curr(cpu)->mm;
> + spin_unlock_irq(&cpu_rq(cpu)->lock);
> + if (current->mm == mm)
> + smp_call_function_single(cpu, membarrier_ipi, NULL, 1);
> + }
> +}
> +
> +#endif /* #ifdef CONFIG_SMP */
> +
> +/*
> + * sys_membarrier - issue memory barrier on current process running threads
> + * @expedited: (0) Lowest overhead. Few milliseconds latency.
> + * (1) Few microseconds latency.

Why do we need both expedited and non-expedited mode? at least, this documentation
is bad. it suggest "you have to use non-expedited mode always!".


> + *
> + * Execute a memory barrier on all running threads of the current process.
> + * Upon completion, the caller thread is ensured that all process threads
> + * have passed through a state where memory accesses match program order.
> + * (non-running threads are de facto in such a state)
> + *
> + * mm_cpumask is used as an approximation. It is a superset of the cpumask to
> + * which we must send IPIs, mainly due to lazy TLB shootdown. Therefore,
> + * we check each runqueue to make sure our ->mm is indeed running on them. This
> + * reduces the risk of disturbing a RT task by sending unnecessary IPIs. There
> + * is still a slight chance to disturb an unrelated task, because we do not lock
> + * the runqueues while sending IPIs, but the real-time effect of this heavy
> + * locking would be worse than the comparatively small disruption of an IPI.
> + *
> + * RED PEN: before assinging a system call number for sys_membarrier() to an
> + * architecture, we must ensure that switch_mm issues full memory barriers (or a
> + * synchronizing instruction having the same effect) between:
> + * - user-space code execution and clear mm_cpumask.
> + * - set mm_cpumask and user-space code execution.
> + * In some case adding a comment to this effect will suffice, in others we will
> + * need to add smp_mb__before_clear_bit()/smp_mb__after_clear_bit() or simply
> + * smp_mb(). These barriers are required to ensure we do not _miss_ a CPU that
> + * need to receive an IPI, which would be a bug.
> + *
> + * On uniprocessor systems, this system call simply returns 0 without doing
> + * anything, so user-space knows it is implemented.
> + */
> +SYSCALL_DEFINE1(membarrier, int, expedited)
> +{
> +#ifdef CONFIG_SMP
> + cpumask_var_t tmpmask;
> + struct mm_struct *mm;
> + int cpu;
> +
> + if (unlikely(thread_group_empty(current) || (num_online_cpus() == 1)))
> + return 0;
> + if (!unlikely(expedited)) {

unlikely(!expedited)?


> + synchronize_sched();
> + return 0;
> + }
> + /*
> + * Memory barrier on the caller thread _before_ sending first
> + * IPI. Matches memory barriers around mm_cpumask modification in
> + * switch_mm().
> + */
> + smp_mb();
> + if (!alloc_cpumask_var(&tmpmask, GFP_KERNEL)) {
> + membarrier_retry();
> + goto unlock;
> + }

if CONFIG_CPUMASK_OFFSTACK=1, alloc_cpumask_var call kmalloc. FWIW,
kmalloc calling seems destory the worth of this patch.

#ifdef CONFIG_CPUMASK_OFFSTACK
membarrier_retry();
goto unlock;
#endif

is better? I'm not sure.


> + cpumask_copy(tmpmask, mm_cpumask(current->mm));
> + preempt_disable();
> + cpumask_clear_cpu(smp_processor_id(), tmpmask);
> + for_each_cpu(cpu, tmpmask) {
> + spin_lock_irq(&cpu_rq(cpu)->lock);
> + mm = cpu_curr(cpu)->mm;
> + spin_unlock_irq(&cpu_rq(cpu)->lock);
> + if (current->mm != mm)
> + cpumask_clear_cpu(cpu, tmpmask);
> + }
> + smp_call_function_many(tmpmask, membarrier_ipi, NULL, 1);
> + preempt_enable();
> + free_cpumask_var(tmpmask);
> +unlock:
> + /*
> + * Memory barrier on the caller thread _after_ we finished
> + * waiting for the last IPI. Matches memory barriers around mm_cpumask
> + * modification in switch_mm().
> + */
> + smp_mb();
> +#endif /* #ifdef CONFIG_SMP */
> + return 0;
> +}

2010-01-13 03:58:13

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

* KOSAKI Motohiro ([email protected]) wrote:
> Hi
>
> Interesting patch :)
>
> I have few comments.
>
> > Index: linux-2.6-lttng/kernel/sched.c
> > ===================================================================
> > --- linux-2.6-lttng.orig/kernel/sched.c 2010-01-12 10:25:47.000000000 -0500
> > +++ linux-2.6-lttng/kernel/sched.c 2010-01-12 14:33:20.000000000 -0500
> > @@ -10822,6 +10822,117 @@ struct cgroup_subsys cpuacct_subsys = {
> > };
> > #endif /* CONFIG_CGROUP_CPUACCT */
> >
> > +#ifdef CONFIG_SMP
> > +
> > +/*
> > + * Execute a memory barrier on all active threads from the current process
> > + * on SMP systems. Do not rely on implicit barriers in IPI handler execution,
> > + * because batched IPI lists are synchronized with spinlocks rather than full
> > + * memory barriers. This is not the bulk of the overhead anyway, so let's stay
> > + * on the safe side.
> > + */
> > +static void membarrier_ipi(void *unused)
> > +{
> > + smp_mb();
> > +}
> > +
> > +/*
> > + * Handle out-of-mem by sending per-cpu IPIs instead.
> > + */
> > +static void membarrier_retry(void)
> > +{
> > + struct mm_struct *mm;
> > + int cpu;
> > +
> > + for_each_cpu(cpu, mm_cpumask(current->mm)) {
> > + spin_lock_irq(&cpu_rq(cpu)->lock);
> > + mm = cpu_curr(cpu)->mm;
> > + spin_unlock_irq(&cpu_rq(cpu)->lock);
> > + if (current->mm == mm)
> > + smp_call_function_single(cpu, membarrier_ipi, NULL, 1);
> > + }
> > +}
> > +
> > +#endif /* #ifdef CONFIG_SMP */
> > +
> > +/*
> > + * sys_membarrier - issue memory barrier on current process running threads
> > + * @expedited: (0) Lowest overhead. Few milliseconds latency.
> > + * (1) Few microseconds latency.
>
> Why do we need both expedited and non-expedited mode? at least, this documentation
> is bad. it suggest "you have to use non-expedited mode always!".

Right. Maybe I should rather write:

+ * @expedited: (0) Low overhead, but slow execution (few milliseconds)
+ * (1) Slightly higher overhead, fast execution (few microseconds)

And I could probably go as far as adding a few paragraphs:

Using the non-expedited mode is recommended for applications which can
afford leaving the caller thread waiting for a few milliseconds. A good
example would be a thread dedicated to execute RCU callbacks, which
waits for callbacks to enqueue most of the time anyway.

The expedited mode is recommended whenever the application needs to have
control returning to the caller thread as quickly as possible. An
example of such application would be one which uses the same thread to
perform data structure updates and issue the RCU synchronization.

It is perfectly safe to call both expedited and non-expedited
sys_membarriers in a process.


Does that help ?

>
>
> > + *
> > + * Execute a memory barrier on all running threads of the current process.
> > + * Upon completion, the caller thread is ensured that all process threads
> > + * have passed through a state where memory accesses match program order.
> > + * (non-running threads are de facto in such a state)
> > + *
> > + * mm_cpumask is used as an approximation. It is a superset of the cpumask to
> > + * which we must send IPIs, mainly due to lazy TLB shootdown. Therefore,
> > + * we check each runqueue to make sure our ->mm is indeed running on them. This
> > + * reduces the risk of disturbing a RT task by sending unnecessary IPIs. There
> > + * is still a slight chance to disturb an unrelated task, because we do not lock
> > + * the runqueues while sending IPIs, but the real-time effect of this heavy
> > + * locking would be worse than the comparatively small disruption of an IPI.
> > + *
> > + * RED PEN: before assinging a system call number for sys_membarrier() to an
> > + * architecture, we must ensure that switch_mm issues full memory barriers (or a
> > + * synchronizing instruction having the same effect) between:
> > + * - user-space code execution and clear mm_cpumask.
> > + * - set mm_cpumask and user-space code execution.
> > + * In some case adding a comment to this effect will suffice, in others we will
> > + * need to add smp_mb__before_clear_bit()/smp_mb__after_clear_bit() or simply
> > + * smp_mb(). These barriers are required to ensure we do not _miss_ a CPU that
> > + * need to receive an IPI, which would be a bug.
> > + *
> > + * On uniprocessor systems, this system call simply returns 0 without doing
> > + * anything, so user-space knows it is implemented.
> > + */
> > +SYSCALL_DEFINE1(membarrier, int, expedited)
> > +{
> > +#ifdef CONFIG_SMP
> > + cpumask_var_t tmpmask;
> > + struct mm_struct *mm;
> > + int cpu;
> > +
> > + if (unlikely(thread_group_empty(current) || (num_online_cpus() == 1)))
> > + return 0;
> > + if (!unlikely(expedited)) {
>
> unlikely(!expedited)?

Woops, thanks !

>
>
> > + synchronize_sched();
> > + return 0;
> > + }
> > + /*
> > + * Memory barrier on the caller thread _before_ sending first
> > + * IPI. Matches memory barriers around mm_cpumask modification in
> > + * switch_mm().
> > + */
> > + smp_mb();
> > + if (!alloc_cpumask_var(&tmpmask, GFP_KERNEL)) {
> > + membarrier_retry();
> > + goto unlock;
> > + }
>
> if CONFIG_CPUMASK_OFFSTACK=1, alloc_cpumask_var call kmalloc. FWIW,
> kmalloc calling seems destory the worth of this patch.

Why ? I'm not sure I understand your point. Even if we call kmalloc to
allocate the cpumask, this is a constant overhead. The benefit of
smp_call_function_many() over smp_call_function_single() is that it
scales better by allowing to broadcast IPIs when the architecture
supports it. Or maybe I'm missing something ?

>
> #ifdef CONFIG_CPUMASK_OFFSTACK
> membarrier_retry();
> goto unlock;
> #endif
>
> is better? I'm not sure.

Thanks for the comments !

Mathieu


>
>
> > + cpumask_copy(tmpmask, mm_cpumask(current->mm));
> > + preempt_disable();
> > + cpumask_clear_cpu(smp_processor_id(), tmpmask);
> > + for_each_cpu(cpu, tmpmask) {
> > + spin_lock_irq(&cpu_rq(cpu)->lock);
> > + mm = cpu_curr(cpu)->mm;
> > + spin_unlock_irq(&cpu_rq(cpu)->lock);
> > + if (current->mm != mm)
> > + cpumask_clear_cpu(cpu, tmpmask);
> > + }
> > + smp_call_function_many(tmpmask, membarrier_ipi, NULL, 1);
> > + preempt_enable();
> > + free_cpumask_var(tmpmask);
> > +unlock:
> > + /*
> > + * Memory barrier on the caller thread _after_ we finished
> > + * waiting for the last IPI. Matches memory barriers around mm_cpumask
> > + * modification in switch_mm().
> > + */
> > + smp_mb();
> > +#endif /* #ifdef CONFIG_SMP */
> > + return 0;
> > +}
>
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2010-01-13 04:47:55

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

> * KOSAKI Motohiro ([email protected]) wrote:
> > Hi
> >
> > Interesting patch :)
> >
> > I have few comments.
> >
> > > Index: linux-2.6-lttng/kernel/sched.c
> > > ===================================================================
> > > --- linux-2.6-lttng.orig/kernel/sched.c 2010-01-12 10:25:47.000000000 -0500
> > > +++ linux-2.6-lttng/kernel/sched.c 2010-01-12 14:33:20.000000000 -0500
> > > @@ -10822,6 +10822,117 @@ struct cgroup_subsys cpuacct_subsys = {
> > > };
> > > #endif /* CONFIG_CGROUP_CPUACCT */
> > >
> > > +#ifdef CONFIG_SMP
> > > +
> > > +/*
> > > + * Execute a memory barrier on all active threads from the current process
> > > + * on SMP systems. Do not rely on implicit barriers in IPI handler execution,
> > > + * because batched IPI lists are synchronized with spinlocks rather than full
> > > + * memory barriers. This is not the bulk of the overhead anyway, so let's stay
> > > + * on the safe side.
> > > + */
> > > +static void membarrier_ipi(void *unused)
> > > +{
> > > + smp_mb();
> > > +}
> > > +
> > > +/*
> > > + * Handle out-of-mem by sending per-cpu IPIs instead.
> > > + */
> > > +static void membarrier_retry(void)
> > > +{
> > > + struct mm_struct *mm;
> > > + int cpu;
> > > +
> > > + for_each_cpu(cpu, mm_cpumask(current->mm)) {
> > > + spin_lock_irq(&cpu_rq(cpu)->lock);
> > > + mm = cpu_curr(cpu)->mm;
> > > + spin_unlock_irq(&cpu_rq(cpu)->lock);
> > > + if (current->mm == mm)
> > > + smp_call_function_single(cpu, membarrier_ipi, NULL, 1);
> > > + }
> > > +}
> > > +
> > > +#endif /* #ifdef CONFIG_SMP */
> > > +
> > > +/*
> > > + * sys_membarrier - issue memory barrier on current process running threads
> > > + * @expedited: (0) Lowest overhead. Few milliseconds latency.
> > > + * (1) Few microseconds latency.
> >
> > Why do we need both expedited and non-expedited mode? at least, this documentation
> > is bad. it suggest "you have to use non-expedited mode always!".
>
> Right. Maybe I should rather write:
>
> + * @expedited: (0) Low overhead, but slow execution (few milliseconds)
> + * (1) Slightly higher overhead, fast execution (few microseconds)
>
> And I could probably go as far as adding a few paragraphs:
>
> Using the non-expedited mode is recommended for applications which can
> afford leaving the caller thread waiting for a few milliseconds. A good
> example would be a thread dedicated to execute RCU callbacks, which
> waits for callbacks to enqueue most of the time anyway.
>
> The expedited mode is recommended whenever the application needs to have
> control returning to the caller thread as quickly as possible. An
> example of such application would be one which uses the same thread to
> perform data structure updates and issue the RCU synchronization.
>
> It is perfectly safe to call both expedited and non-expedited
> sys_membarriers in a process.
>
>
> Does that help ?

Do librcu need both? I bet average programmer don't understand this
explanation. please recall, syscall interface are used by non kernel
developers too. If librcu only use either (0) or (1), I hope remove
another one.

But if librcu really need both, the above explanation is enough good.
I think.


> > > + * Memory barrier on the caller thread _before_ sending first
> > > + * IPI. Matches memory barriers around mm_cpumask modification in
> > > + * switch_mm().
> > > + */
> > > + smp_mb();
> > > + if (!alloc_cpumask_var(&tmpmask, GFP_KERNEL)) {
> > > + membarrier_retry();
> > > + goto unlock;
> > > + }
> >
> > if CONFIG_CPUMASK_OFFSTACK=1, alloc_cpumask_var call kmalloc. FWIW,
> > kmalloc calling seems destory the worth of this patch.
>
> Why ? I'm not sure I understand your point. Even if we call kmalloc to
> allocate the cpumask, this is a constant overhead. The benefit of
> smp_call_function_many() over smp_call_function_single() is that it
> scales better by allowing to broadcast IPIs when the architecture
> supports it. Or maybe I'm missing something ?

It depend on what mean "constant overhead". kmalloc might cause
page reclaim and undeterministic delay. I'm not sure (1) How much
membarrier_retry() slower than smp_call_function_many and (2) Which do
you think important average or worst performance. Only I note I don't
think GFP_KERNEL is constant overhead.

hmm...
Do you intend to GFP_ATOMIC?



> >
> > #ifdef CONFIG_CPUMASK_OFFSTACK
> > membarrier_retry();
> > goto unlock;
> > #endif
> >
> > is better? I'm not sure.
>
> Thanks for the comments !
>
> Mathieu
>

2010-01-13 05:00:28

by Nicholas Miell

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

On Tue, 2010-01-12 at 20:37 -0500, Mathieu Desnoyers wrote:
> + * sys_membarrier - issue memory barrier on current process running threads
> + * @expedited: (0) Lowest overhead. Few milliseconds latency.
> + * (1) Few microseconds latency.
> + *

Alternate ABI proposal, keeping the possibility of future expansion in
mind:

/*
Mandatory flags to the membarrier system call that the kernel must
understand are in the high 16 bits.
*/
#define MEMBARRIER_MANDATORY_MASK 0xFFFF0000

/*
Optional hints that the kernel can ignore are in the low 16 bits.
*/
#define MEMBARRIER_OPTIONAL_MASK 0x0000FFFF

#define MEMBARRIER_EXPEDITED 1

extern int membarrier(unsigned int flags);

And then add to the system call itself:

if ((flags & MEMBARRIER_MANDATORY_MASK) != 0)
return -EINVAL;


--
Nicholas Miell <[email protected]>

2010-01-13 05:31:31

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

On Tue, Jan 12, 2010 at 09:00:23PM -0800, Nicholas Miell wrote:
> On Tue, 2010-01-12 at 20:37 -0500, Mathieu Desnoyers wrote:
> > + * sys_membarrier - issue memory barrier on current process running threads
> > + * @expedited: (0) Lowest overhead. Few milliseconds latency.
> > + * (1) Few microseconds latency.
> > + *
>
> Alternate ABI proposal, keeping the possibility of future expansion in
> mind:
>
> /*
> Mandatory flags to the membarrier system call that the kernel must
> understand are in the high 16 bits.
> */
> #define MEMBARRIER_MANDATORY_MASK 0xFFFF0000
>
> /*
> Optional hints that the kernel can ignore are in the low 16 bits.
> */
> #define MEMBARRIER_OPTIONAL_MASK 0x0000FFFF
>
> #define MEMBARRIER_EXPEDITED 1
>
> extern int membarrier(unsigned int flags);
>
> And then add to the system call itself:
>
> if ((flags & MEMBARRIER_MANDATORY_MASK) != 0)
> return -EINVAL;

Why is it OK to ignore the developer's request for an expedited
membarrer()? The guy who expected the syscall to complete in a few
microseconds might not be so happy to have it take many milliseconds.
By the same token, the guy who specified non-expedited so as to minimally
disturb other threads in the system might not be so happy to see them
all be IPIed for no good reason. ;-)

Thanx, Paul

2010-01-13 05:33:20

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

On Wed, Jan 13, 2010 at 01:47:50PM +0900, KOSAKI Motohiro wrote:
> > * KOSAKI Motohiro ([email protected]) wrote:
> > > Hi
> > >
> > > Interesting patch :)
> > >
> > > I have few comments.
> > >
> > > > Index: linux-2.6-lttng/kernel/sched.c
> > > > ===================================================================
> > > > --- linux-2.6-lttng.orig/kernel/sched.c 2010-01-12 10:25:47.000000000 -0500
> > > > +++ linux-2.6-lttng/kernel/sched.c 2010-01-12 14:33:20.000000000 -0500
> > > > @@ -10822,6 +10822,117 @@ struct cgroup_subsys cpuacct_subsys = {
> > > > };
> > > > #endif /* CONFIG_CGROUP_CPUACCT */
> > > >
> > > > +#ifdef CONFIG_SMP
> > > > +
> > > > +/*
> > > > + * Execute a memory barrier on all active threads from the current process
> > > > + * on SMP systems. Do not rely on implicit barriers in IPI handler execution,
> > > > + * because batched IPI lists are synchronized with spinlocks rather than full
> > > > + * memory barriers. This is not the bulk of the overhead anyway, so let's stay
> > > > + * on the safe side.
> > > > + */
> > > > +static void membarrier_ipi(void *unused)
> > > > +{
> > > > + smp_mb();
> > > > +}
> > > > +
> > > > +/*
> > > > + * Handle out-of-mem by sending per-cpu IPIs instead.
> > > > + */
> > > > +static void membarrier_retry(void)
> > > > +{
> > > > + struct mm_struct *mm;
> > > > + int cpu;
> > > > +
> > > > + for_each_cpu(cpu, mm_cpumask(current->mm)) {
> > > > + spin_lock_irq(&cpu_rq(cpu)->lock);
> > > > + mm = cpu_curr(cpu)->mm;
> > > > + spin_unlock_irq(&cpu_rq(cpu)->lock);
> > > > + if (current->mm == mm)
> > > > + smp_call_function_single(cpu, membarrier_ipi, NULL, 1);
> > > > + }
> > > > +}
> > > > +
> > > > +#endif /* #ifdef CONFIG_SMP */
> > > > +
> > > > +/*
> > > > + * sys_membarrier - issue memory barrier on current process running threads
> > > > + * @expedited: (0) Lowest overhead. Few milliseconds latency.
> > > > + * (1) Few microseconds latency.
> > >
> > > Why do we need both expedited and non-expedited mode? at least, this documentation
> > > is bad. it suggest "you have to use non-expedited mode always!".
> >
> > Right. Maybe I should rather write:
> >
> > + * @expedited: (0) Low overhead, but slow execution (few milliseconds)
> > + * (1) Slightly higher overhead, fast execution (few microseconds)
> >
> > And I could probably go as far as adding a few paragraphs:
> >
> > Using the non-expedited mode is recommended for applications which can
> > afford leaving the caller thread waiting for a few milliseconds. A good
> > example would be a thread dedicated to execute RCU callbacks, which
> > waits for callbacks to enqueue most of the time anyway.
> >
> > The expedited mode is recommended whenever the application needs to have
> > control returning to the caller thread as quickly as possible. An
> > example of such application would be one which uses the same thread to
> > perform data structure updates and issue the RCU synchronization.
> >
> > It is perfectly safe to call both expedited and non-expedited
> > sys_membarriers in a process.
> >
> >
> > Does that help ?
>
> Do librcu need both? I bet average programmer don't understand this
> explanation. please recall, syscall interface are used by non kernel
> developers too. If librcu only use either (0) or (1), I hope remove
> another one.

I believe that user-mode RCU will need both, and for much the same
reasons that kernel-mode RCU now has both expedited and non-expedited
grace periods.

Thanx, Paul

> But if librcu really need both, the above explanation is enough good.
> I think.
>
>
> > > > + * Memory barrier on the caller thread _before_ sending first
> > > > + * IPI. Matches memory barriers around mm_cpumask modification in
> > > > + * switch_mm().
> > > > + */
> > > > + smp_mb();
> > > > + if (!alloc_cpumask_var(&tmpmask, GFP_KERNEL)) {
> > > > + membarrier_retry();
> > > > + goto unlock;
> > > > + }
> > >
> > > if CONFIG_CPUMASK_OFFSTACK=1, alloc_cpumask_var call kmalloc. FWIW,
> > > kmalloc calling seems destory the worth of this patch.
> >
> > Why ? I'm not sure I understand your point. Even if we call kmalloc to
> > allocate the cpumask, this is a constant overhead. The benefit of
> > smp_call_function_many() over smp_call_function_single() is that it
> > scales better by allowing to broadcast IPIs when the architecture
> > supports it. Or maybe I'm missing something ?
>
> It depend on what mean "constant overhead". kmalloc might cause
> page reclaim and undeterministic delay. I'm not sure (1) How much
> membarrier_retry() slower than smp_call_function_many and (2) Which do
> you think important average or worst performance. Only I note I don't
> think GFP_KERNEL is constant overhead.
>
> hmm...
> Do you intend to GFP_ATOMIC?
>
>
>
> > >
> > > #ifdef CONFIG_CPUMASK_OFFSTACK
> > > membarrier_retry();
> > > goto unlock;
> > > #endif
> > >
> > > is better? I'm not sure.
> >
> > Thanks for the comments !
> >
> > Mathieu
> >
>
>

2010-01-13 05:40:10

by Nicholas Miell

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

On Tue, 2010-01-12 at 21:31 -0800, Paul E. McKenney wrote:
> Why is it OK to ignore the developer's request for an expedited
> membarrer()? The guy who expected the syscall to complete in a few
> microseconds might not be so happy to have it take many milliseconds.
> By the same token, the guy who specified non-expedited so as to minimally
> disturb other threads in the system might not be so happy to see them
> all be IPIed for no good reason. ;-)
>
> Thanx, Paul

Because the behavior is still correct, even if it is slower than you'd
expect. It doesn't really matter where the expedited flag goes, though,
because every future kernel will understand it.

--
Nicholas Miell <[email protected]>

2010-01-13 11:07:29

by Heiko Carstens

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

On Tue, Jan 12, 2010 at 08:37:57PM -0500, Mathieu Desnoyers wrote:
> +static void membarrier_retry(void)
> +{
> + struct mm_struct *mm;
> + int cpu;
> +
> + for_each_cpu(cpu, mm_cpumask(current->mm)) {
> + spin_lock_irq(&cpu_rq(cpu)->lock);
> + mm = cpu_curr(cpu)->mm;
> + spin_unlock_irq(&cpu_rq(cpu)->lock);
> + if (current->mm == mm)
> + smp_call_function_single(cpu, membarrier_ipi, NULL, 1);
> + }

You would need to disable cpu unplug operations while doing this
or you might end up sending IPIs to offline cpus.

> + cpumask_copy(tmpmask, mm_cpumask(current->mm));
> + preempt_disable();
> + cpumask_clear_cpu(smp_processor_id(), tmpmask);
> + for_each_cpu(cpu, tmpmask) {
> + spin_lock_irq(&cpu_rq(cpu)->lock);
> + mm = cpu_curr(cpu)->mm;

This might access the rq of an offline cpu.
But maybe it's intended since offline cpus "run" idle?

2010-01-13 14:38:11

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

* Nicholas Miell ([email protected]) wrote:
> On Tue, 2010-01-12 at 21:31 -0800, Paul E. McKenney wrote:
> > Why is it OK to ignore the developer's request for an expedited
> > membarrer()? The guy who expected the syscall to complete in a few
> > microseconds might not be so happy to have it take many milliseconds.
> > By the same token, the guy who specified non-expedited so as to minimally
> > disturb other threads in the system might not be so happy to see them
> > all be IPIed for no good reason. ;-)
> >
> > Thanx, Paul
>
> Because the behavior is still correct, even if it is slower than you'd
> expect. It doesn't really matter where the expedited flag goes, though,
> because every future kernel will understand it.

16ms vs few ?s is such a huge performance difference that it's barely
adequate to say that the behavior is still correct, but we definitely
cannot say it is unchanged. It can really render some applications
unusable.

If, for some reason, the expedited version of the system call happens to
be unimplemented, we should return -EINVAL, so the application can deal
with it in the approproate way (which could be, for instance, to use a
fall-back doing memory barriers on the RCU read-side).

But I don't see any reason for not implementing the expedited version
properly in the first place.

Thanks,

Mathieu

>
> --
> Nicholas Miell <[email protected]>
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2010-01-13 14:47:01

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

* Heiko Carstens ([email protected]) wrote:
> On Tue, Jan 12, 2010 at 08:37:57PM -0500, Mathieu Desnoyers wrote:
> > +static void membarrier_retry(void)
> > +{
> > + struct mm_struct *mm;
> > + int cpu;
> > +
> > + for_each_cpu(cpu, mm_cpumask(current->mm)) {
> > + spin_lock_irq(&cpu_rq(cpu)->lock);
> > + mm = cpu_curr(cpu)->mm;
> > + spin_unlock_irq(&cpu_rq(cpu)->lock);
> > + if (current->mm == mm)
> > + smp_call_function_single(cpu, membarrier_ipi, NULL, 1);
> > + }
>
> You would need to disable cpu unplug operations while doing this
> or you might end up sending IPIs to offline cpus.

smp_call_function_single() checks for cpu_online(cpu), and returns
-ENXIO if the cpu is not online. So I think disabling cpu hotplug would
be redundant with this test.

smp_call_function_many uses cpumask_next_and(cpu, mask, cpu_online_mask)
to alter the mask, so no cpu hotplug disabling needed there neither.

These checks are protected by preemption disabling.

>
> > + cpumask_copy(tmpmask, mm_cpumask(current->mm));
> > + preempt_disable();
> > + cpumask_clear_cpu(smp_processor_id(), tmpmask);
> > + for_each_cpu(cpu, tmpmask) {
> > + spin_lock_irq(&cpu_rq(cpu)->lock);
> > + mm = cpu_curr(cpu)->mm;
>
> This might access the rq of an offline cpu.
> But maybe it's intended since offline cpus "run" idle?

In a rare race with hotunplug vs lazy TLB shootdown, yes. Although even
then, as you point out, ->mm will be NULL, so we won't even consider the
CPU for IPI. In any case, I think adding a cpumask online "and" would be
an added performance overhead for the common case compared to the
performance gain in the rare cpu hotunplug race window.

Thanks,

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2010-01-13 15:03:27

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

* KOSAKI Motohiro ([email protected]) wrote:
> > * KOSAKI Motohiro ([email protected]) wrote:
[...]
> > > Why do we need both expedited and non-expedited mode? at least, this documentation
> > > is bad. it suggest "you have to use non-expedited mode always!".
> >
> > Right. Maybe I should rather write:
> >
> > + * @expedited: (0) Low overhead, but slow execution (few milliseconds)
> > + * (1) Slightly higher overhead, fast execution (few microseconds)
> >
> > And I could probably go as far as adding a few paragraphs:
> >
> > Using the non-expedited mode is recommended for applications which can
> > afford leaving the caller thread waiting for a few milliseconds. A good
> > example would be a thread dedicated to execute RCU callbacks, which
> > waits for callbacks to enqueue most of the time anyway.
> >
> > The expedited mode is recommended whenever the application needs to have
> > control returning to the caller thread as quickly as possible. An
> > example of such application would be one which uses the same thread to
> > perform data structure updates and issue the RCU synchronization.
> >
> > It is perfectly safe to call both expedited and non-expedited
> > sys_membarriers in a process.
> >
> >
> > Does that help ?
>
> Do librcu need both? I bet average programmer don't understand this
> explanation. please recall, syscall interface are used by non kernel
> developers too. If librcu only use either (0) or (1), I hope remove
> another one.
>
> But if librcu really need both, the above explanation is enough good.
> I think.

As Paul said, we need both in liburcu. These usage scenarios are
explained in the system call documentation.

>
>
> > > > + * Memory barrier on the caller thread _before_ sending first
> > > > + * IPI. Matches memory barriers around mm_cpumask modification in
> > > > + * switch_mm().
> > > > + */
> > > > + smp_mb();
> > > > + if (!alloc_cpumask_var(&tmpmask, GFP_KERNEL)) {
> > > > + membarrier_retry();
> > > > + goto unlock;
> > > > + }
> > >
> > > if CONFIG_CPUMASK_OFFSTACK=1, alloc_cpumask_var call kmalloc. FWIW,
> > > kmalloc calling seems destory the worth of this patch.
> >
> > Why ? I'm not sure I understand your point. Even if we call kmalloc to
> > allocate the cpumask, this is a constant overhead. The benefit of
> > smp_call_function_many() over smp_call_function_single() is that it
> > scales better by allowing to broadcast IPIs when the architecture
> > supports it. Or maybe I'm missing something ?
>
> It depend on what mean "constant overhead". kmalloc might cause
> page reclaim and undeterministic delay. I'm not sure (1) How much
> membarrier_retry() slower than smp_call_function_many and (2) Which do
> you think important average or worst performance. Only I note I don't
> think GFP_KERNEL is constant overhead.

10,000,000 sys_membarrier calls (varying the number of threads to which
we send IPIs), IPI-to-many, 8-core system:

T=1: 0m20.173s
T=2: 0m20.506s
T=3: 0m22.632s
T=4: 0m24.759s
T=5: 0m26.633s
T=6: 0m29.654s
T=7: 0m30.669s

Just doing local mb()+single IPI to T other threads:

T=1: 0m18.801s
T=2: 0m29.086s
T=3: 0m46.841s
T=4: 0m53.758s
T=5: 1m10.856s
T=6: 1m21.142s
T=7: 1m38.362s

So sending single IPIs adds about 1.5 microseconds per extra core. With
the IPI-to-many scheme, we add about 0.2 microseconds per extra core. So
we have a factor 10 gain in scalability. The initial cost of the cpumask
allocation (which seems to be allocated on the stack in my config) is
just about 1.4 microseconds. So here, we only have a small gain for the
1 IPI case, which does not justify the added complexity of dealing with
it differently.

Also... it's pretty much a slow path anyway compared to the RCU
read-side. I just don't want this slow path to scale badly.

>
> hmm...
> Do you intend to GFP_ATOMIC?

Would it help to lower the allocation overhead ?

Thanks,

Mathieu


--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2010-01-13 15:58:56

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

On Tue, Jan 12, 2010 at 09:39:56PM -0800, Nicholas Miell wrote:
> On Tue, 2010-01-12 at 21:31 -0800, Paul E. McKenney wrote:
> > Why is it OK to ignore the developer's request for an expedited
> > membarrer()? The guy who expected the syscall to complete in a few
> > microseconds might not be so happy to have it take many milliseconds.
> > By the same token, the guy who specified non-expedited so as to minimally
> > disturb other threads in the system might not be so happy to see them
> > all be IPIed for no good reason. ;-)
> >
> > Thanx, Paul
>
> Because the behavior is still correct, even if it is slower than you'd
> expect. It doesn't really matter where the expedited flag goes, though,
> because every future kernel will understand it.

In a real-time application, no shortage of which run on Linux, "going
slower than you expect" is a bug, right?

Thanx, Paul

2010-01-13 16:40:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

On Tue, 2010-01-12 at 20:37 -0500, Mathieu Desnoyers wrote:
> + for_each_cpu(cpu, tmpmask) {
> + spin_lock_irq(&cpu_rq(cpu)->lock);
> + mm = cpu_curr(cpu)->mm;
> + spin_unlock_irq(&cpu_rq(cpu)->lock);
> + if (current->mm != mm)
> + cpumask_clear_cpu(cpu, tmpmask);
> + }

Why not:

rcu_read_lock();
if (current->mm != cpu_curr(cpu)->mm)
cpumask_clear_cpu(cpu, tmpmask);
rcu_read_unlock();

the RCU read lock ensures the task_struct obtained remains valid, and it
avoids taking the rq->lock.

2010-01-13 18:13:55

by Nicholas Miell

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

On Wed, 2010-01-13 at 09:38 -0500, Mathieu Desnoyers wrote:
> * Nicholas Miell ([email protected]) wrote:
> > On Tue, 2010-01-12 at 21:31 -0800, Paul E. McKenney wrote:
> > > Why is it OK to ignore the developer's request for an expedited
> > > membarrer()? The guy who expected the syscall to complete in a few
> > > microseconds might not be so happy to have it take many milliseconds.
> > > By the same token, the guy who specified non-expedited so as to minimally
> > > disturb other threads in the system might not be so happy to see them
> > > all be IPIed for no good reason. ;-)
> > >
> > > Thanx, Paul
> >
> > Because the behavior is still correct, even if it is slower than you'd
> > expect. It doesn't really matter where the expedited flag goes, though,
> > because every future kernel will understand it.
>
> 16ms vs few µs is such a huge performance difference that it's barely
> adequate to say that the behavior is still correct, but we definitely
> cannot say it is unchanged. It can really render some applications
> unusable.
>
> If, for some reason, the expedited version of the system call happens to
> be unimplemented, we should return -EINVAL, so the application can deal
> with it in the approproate way (which could be, for instance, to use a
> fall-back doing memory barriers on the RCU read-side).
>
> But I don't see any reason for not implementing the expedited version
> properly in the first place.
>

You're focusing on the least important detail of the proposal. It
doesn't matter whether the expedited flag is compat flag or an incompat
flag, because every version of kernel that ever has the membarrier
system call will know what it means, and even if for some reason it
doesn't implement expedited membarriers, it will still be able to
recognize the flag and return an error.

The whole point of compat and incompat flags is that it allows new
applications to run on old kernels and either work or fail as
appropriate, depending on whether the new features they're using must be
implemented or can be silently ignored.

--
Nicholas Miell <[email protected]>

2010-01-13 18:24:45

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

* Nicholas Miell ([email protected]) wrote:
> On Wed, 2010-01-13 at 09:38 -0500, Mathieu Desnoyers wrote:
> > * Nicholas Miell ([email protected]) wrote:
> > > On Tue, 2010-01-12 at 21:31 -0800, Paul E. McKenney wrote:
> > > > Why is it OK to ignore the developer's request for an expedited
> > > > membarrer()? The guy who expected the syscall to complete in a few
> > > > microseconds might not be so happy to have it take many milliseconds.
> > > > By the same token, the guy who specified non-expedited so as to minimally
> > > > disturb other threads in the system might not be so happy to see them
> > > > all be IPIed for no good reason. ;-)
> > > >
> > > > Thanx, Paul
> > >
> > > Because the behavior is still correct, even if it is slower than you'd
> > > expect. It doesn't really matter where the expedited flag goes, though,
> > > because every future kernel will understand it.
> >
> > 16ms vs few ?s is such a huge performance difference that it's barely
> > adequate to say that the behavior is still correct, but we definitely
> > cannot say it is unchanged. It can really render some applications
> > unusable.
> >
> > If, for some reason, the expedited version of the system call happens to
> > be unimplemented, we should return -EINVAL, so the application can deal
> > with it in the approproate way (which could be, for instance, to use a
> > fall-back doing memory barriers on the RCU read-side).
> >
> > But I don't see any reason for not implementing the expedited version
> > properly in the first place.
> >
>
> You're focusing on the least important detail of the proposal. It
> doesn't matter whether the expedited flag is compat flag or an incompat
> flag, because every version of kernel that ever has the membarrier
> system call will know what it means, and even if for some reason it
> doesn't implement expedited membarriers, it will still be able to
> recognize the flag and return an error.
>
> The whole point of compat and incompat flags is that it allows new
> applications to run on old kernels and either work or fail as
> appropriate, depending on whether the new features they're using must be
> implemented or can be silently ignored.

I see. Thanks for the explanation. Then the expedited flag should
clearly be part of the mandatory flags.

Can you point me to other system calls that are doing this ?

Thanks,

Mathieu

>
> --
> Nicholas Miell <[email protected]>
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2010-01-13 18:41:46

by Nicholas Miell

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

On Wed, 2010-01-13 at 13:24 -0500, Mathieu Desnoyers wrote:
> * Nicholas Miell ([email protected]) wrote:
>
> > The whole point of compat and incompat flags is that it allows new
> > applications to run on old kernels and either work or fail as
> > appropriate, depending on whether the new features they're using must be
> > implemented or can be silently ignored.
>
> I see. Thanks for the explanation. Then the expedited flag should
> clearly be part of the mandatory flags.
>
> Can you point me to other system calls that are doing this ?
>
> Thanks,
>
> Mathieu

Not off the top of my head, but I did steal the idea from the ext2/3/4
disk format.

--
Nicholas Miell <[email protected]>

2010-01-13 19:17:32

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

* Nicholas Miell ([email protected]) wrote:
> On Wed, 2010-01-13 at 13:24 -0500, Mathieu Desnoyers wrote:
> > * Nicholas Miell ([email protected]) wrote:
> >
> > > The whole point of compat and incompat flags is that it allows new
> > > applications to run on old kernels and either work or fail as
> > > appropriate, depending on whether the new features they're using must be
> > > implemented or can be silently ignored.
> >
> > I see. Thanks for the explanation. Then the expedited flag should
> > clearly be part of the mandatory flags.
> >
> > Can you point me to other system calls that are doing this ?
> >
> > Thanks,
> >
> > Mathieu
>
> Not off the top of my head, but I did steal the idea from the ext2/3/4
> disk format.

Sounds a bit over-engineered to me for system calls, but who knows if we
eventually have to extend sys_membarrier(). This involves that, right
now, I'd have to add a header to include/linux to define these flags.
Also, "int expedited" is a bit clearer, but less flexible, than "int
flags". Anyone else have comments about this ?

Thanks,

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2010-01-13 19:36:08

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

* Peter Zijlstra ([email protected]) wrote:
> On Tue, 2010-01-12 at 20:37 -0500, Mathieu Desnoyers wrote:
> > + for_each_cpu(cpu, tmpmask) {
> > + spin_lock_irq(&cpu_rq(cpu)->lock);
> > + mm = cpu_curr(cpu)->mm;
> > + spin_unlock_irq(&cpu_rq(cpu)->lock);
> > + if (current->mm != mm)
> > + cpumask_clear_cpu(cpu, tmpmask);
> > + }
>
> Why not:
>
> rcu_read_lock();
> if (current->mm != cpu_curr(cpu)->mm)
> cpumask_clear_cpu(cpu, tmpmask);
> rcu_read_unlock();
>
> the RCU read lock ensures the task_struct obtained remains valid, and it
> avoids taking the rq->lock.
>

If we go for a simple rcu_read_lock, I think that we need a smp_mb()
after switch_to() updates the current task on the remote CPU, before it
returns to user-space. Do we have this guarantee for all architectures ?

So what I'm looking for, overall, is:

schedule()
...
switch_mm()
smp_mb()
clear mm_cpumask
set mm_cpumask
switch_to()
update current task
smp_mb()

If we have that, then the rcu_read_lock should work.

What the rq lock currently gives us is the guarantee that if the current
thread changes on a remote CPU while we are not holding this lock, then
a full scheduler execution is performed, which implies a memory barrier
if we change the current thread (it does, right ?).

Thanks,

Mathieu


--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2010-01-13 19:42:16

by David Daney

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

Mathieu Desnoyers wrote:
> * Nicholas Miell ([email protected]) wrote:
>> On Wed, 2010-01-13 at 13:24 -0500, Mathieu Desnoyers wrote:
>>> * Nicholas Miell ([email protected]) wrote:
>>>
>>>> The whole point of compat and incompat flags is that it allows new
>>>> applications to run on old kernels and either work or fail as
>>>> appropriate, depending on whether the new features they're using must be
>>>> implemented or can be silently ignored.
>>> I see. Thanks for the explanation. Then the expedited flag should
>>> clearly be part of the mandatory flags.
>>>
>>> Can you point me to other system calls that are doing this ?
>>>
>>> Thanks,
>>>
>>> Mathieu
>> Not off the top of my head, but I did steal the idea from the ext2/3/4
>> disk format.
>
> Sounds a bit over-engineered to me for system calls, but who knows if we
> eventually have to extend sys_membarrier(). This involves that, right
> now, I'd have to add a header to include/linux to define these flags.
> Also, "int expedited" is a bit clearer, but less flexible, than "int
> flags". Anyone else have comments about this ?
>

It doesn't bother me that you have to do extra work to add the flag
definitions to a header file. :-)

As I understand it, the proposal is to have the option to extend the ABI
based on as yet undefined flag bits. This doesn't seem like a bad thing.

The runtime overhead of testing a single bit vs. non-zero in the
parameter shouldn't be an issue.

David Daney

2010-01-13 19:53:33

by Nicholas Miell

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

On Wed, 2010-01-13 at 11:42 -0800, David Daney wrote:
> Mathieu Desnoyers wrote:
> > * Nicholas Miell ([email protected]) wrote:
> >> On Wed, 2010-01-13 at 13:24 -0500, Mathieu Desnoyers wrote:
> >>> * Nicholas Miell ([email protected]) wrote:
> >>>
> >>>> The whole point of compat and incompat flags is that it allows new
> >>>> applications to run on old kernels and either work or fail as
> >>>> appropriate, depending on whether the new features they're using must be
> >>>> implemented or can be silently ignored.
> >>> I see. Thanks for the explanation. Then the expedited flag should
> >>> clearly be part of the mandatory flags.
> >>>
> >>> Can you point me to other system calls that are doing this ?
> >>>
> >>> Thanks,
> >>>
> >>> Mathieu
> >> Not off the top of my head, but I did steal the idea from the ext2/3/4
> >> disk format.
> >
> > Sounds a bit over-engineered to me for system calls, but who knows if we
> > eventually have to extend sys_membarrier(). This involves that, right
> > now, I'd have to add a header to include/linux to define these flags.
> > Also, "int expedited" is a bit clearer, but less flexible, than "int
> > flags". Anyone else have comments about this ?
> >
>
> It doesn't bother me that you have to do extra work to add the flag
> definitions to a header file. :-)
>
> As I understand it, the proposal is to have the option to extend the ABI
> based on as yet undefined flag bits. This doesn't seem like a bad thing.
>
> The runtime overhead of testing a single bit vs. non-zero in the
> parameter shouldn't be an issue.
>

The recent introduction of accept4(), signalfd4(), eventfd2(),
epoll_create1(), dup3(), pipe2(), and inotify_init1() suggest that this
is the kind of thing you want to plan for, because you're probably going
to end up doing it anyway.

--
Nicholas Miell <[email protected]>

2010-01-13 23:42:48

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

* Nicholas Miell ([email protected]) wrote:
> On Wed, 2010-01-13 at 11:42 -0800, David Daney wrote:
> > Mathieu Desnoyers wrote:
> > > * Nicholas Miell ([email protected]) wrote:
> > >> On Wed, 2010-01-13 at 13:24 -0500, Mathieu Desnoyers wrote:
> > >>> * Nicholas Miell ([email protected]) wrote:
> > >>>
> > >>>> The whole point of compat and incompat flags is that it allows new
> > >>>> applications to run on old kernels and either work or fail as
> > >>>> appropriate, depending on whether the new features they're using must be
> > >>>> implemented or can be silently ignored.
> > >>> I see. Thanks for the explanation. Then the expedited flag should
> > >>> clearly be part of the mandatory flags.
> > >>>
> > >>> Can you point me to other system calls that are doing this ?
> > >>>
> > >>> Thanks,
> > >>>
> > >>> Mathieu
> > >> Not off the top of my head, but I did steal the idea from the ext2/3/4
> > >> disk format.
> > >
> > > Sounds a bit over-engineered to me for system calls, but who knows if we
> > > eventually have to extend sys_membarrier(). This involves that, right
> > > now, I'd have to add a header to include/linux to define these flags.
> > > Also, "int expedited" is a bit clearer, but less flexible, than "int
> > > flags". Anyone else have comments about this ?
> > >
> >
> > It doesn't bother me that you have to do extra work to add the flag
> > definitions to a header file. :-)
> >

Work is not a problem. I just wanted to make sure I would not be going
in circles. :)

> > As I understand it, the proposal is to have the option to extend the ABI
> > based on as yet undefined flag bits. This doesn't seem like a bad thing.
> >
> > The runtime overhead of testing a single bit vs. non-zero in the
> > parameter shouldn't be an issue.
> >
>
> The recent introduction of accept4(), signalfd4(), eventfd2(),
> epoll_create1(), dup3(), pipe2(), and inotify_init1() suggest that this
> is the kind of thing you want to plan for, because you're probably going
> to end up doing it anyway.

Yes, that's a very convincing argument. OK, I'll cook something for v6.

Thanks a lot,

Mathieu

>
> --
> Nicholas Miell <[email protected]>
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2010-01-14 00:15:27

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

> * KOSAKI Motohiro ([email protected]) wrote:
> > > * KOSAKI Motohiro ([email protected]) wrote:
> [...]
> > > > Why do we need both expedited and non-expedited mode? at least, this documentation
> > > > is bad. it suggest "you have to use non-expedited mode always!".
> > >
> > > Right. Maybe I should rather write:
> > >
> > > + * @expedited: (0) Low overhead, but slow execution (few milliseconds)
> > > + * (1) Slightly higher overhead, fast execution (few microseconds)
> > >
> > > And I could probably go as far as adding a few paragraphs:
> > >
> > > Using the non-expedited mode is recommended for applications which can
> > > afford leaving the caller thread waiting for a few milliseconds. A good
> > > example would be a thread dedicated to execute RCU callbacks, which
> > > waits for callbacks to enqueue most of the time anyway.
> > >
> > > The expedited mode is recommended whenever the application needs to have
> > > control returning to the caller thread as quickly as possible. An
> > > example of such application would be one which uses the same thread to
> > > perform data structure updates and issue the RCU synchronization.
> > >
> > > It is perfectly safe to call both expedited and non-expedited
> > > sys_membarriers in a process.
> > >
> > >
> > > Does that help ?
> >
> > Do librcu need both? I bet average programmer don't understand this
> > explanation. please recall, syscall interface are used by non kernel
> > developers too. If librcu only use either (0) or (1), I hope remove
> > another one.
> >
> > But if librcu really need both, the above explanation is enough good.
> > I think.
>
> As Paul said, we need both in liburcu. These usage scenarios are
> explained in the system call documentation.

ok. thanks.


> > > > > + * Memory barrier on the caller thread _before_ sending first
> > > > > + * IPI. Matches memory barriers around mm_cpumask modification in
> > > > > + * switch_mm().
> > > > > + */
> > > > > + smp_mb();
> > > > > + if (!alloc_cpumask_var(&tmpmask, GFP_KERNEL)) {
> > > > > + membarrier_retry();
> > > > > + goto unlock;
> > > > > + }
> > > >
> > > > if CONFIG_CPUMASK_OFFSTACK=1, alloc_cpumask_var call kmalloc. FWIW,
> > > > kmalloc calling seems destory the worth of this patch.
> > >
> > > Why ? I'm not sure I understand your point. Even if we call kmalloc to
> > > allocate the cpumask, this is a constant overhead. The benefit of
> > > smp_call_function_many() over smp_call_function_single() is that it
> > > scales better by allowing to broadcast IPIs when the architecture
> > > supports it. Or maybe I'm missing something ?
> >
> > It depend on what mean "constant overhead". kmalloc might cause
> > page reclaim and undeterministic delay. I'm not sure (1) How much
> > membarrier_retry() slower than smp_call_function_many and (2) Which do
> > you think important average or worst performance. Only I note I don't
> > think GFP_KERNEL is constant overhead.
>
> 10,000,000 sys_membarrier calls (varying the number of threads to which
> we send IPIs), IPI-to-many, 8-core system:
>
> T=1: 0m20.173s
> T=2: 0m20.506s
> T=3: 0m22.632s
> T=4: 0m24.759s
> T=5: 0m26.633s
> T=6: 0m29.654s
> T=7: 0m30.669s
>
> Just doing local mb()+single IPI to T other threads:
>
> T=1: 0m18.801s
> T=2: 0m29.086s
> T=3: 0m46.841s
> T=4: 0m53.758s
> T=5: 1m10.856s
> T=6: 1m21.142s
> T=7: 1m38.362s
>
> So sending single IPIs adds about 1.5 microseconds per extra core. With
> the IPI-to-many scheme, we add about 0.2 microseconds per extra core. So
> we have a factor 10 gain in scalability. The initial cost of the cpumask
> allocation (which seems to be allocated on the stack in my config) is
> just about 1.4 microseconds. So here, we only have a small gain for the
> 1 IPI case, which does not justify the added complexity of dealing with
> it differently.

I'd like to discuss to separate CONFIG_CPUMASK_OFFSTACK=1 and CONFIG_CPUMASK_OFFSTACK=0.

CONFIG_CPUMASK_OFFSTACK=0 (your config)
- cpumask is allocated on stask
- alloc_cpumask_var() is nop (yes, nop is constant overhead ;)
- alloc_cpumask_var() always return 1, then membarrier_retry() is never called.
- alloc_cpumask_var() ignore GFP_KERNEL parameter

CONFIG_CPUMASK_OFFSTACK=1 and use GFP_KERNEL
- cpumask is allocated on heap
- alloc_cpumask_var() is the wrapper of kmalloc()
- GFP_KERNEL parameter is passed kmalloc
- GFP_KERNEL mean alloc_cpumask_var() always return 1, except
oom-killer case. IOW, membarrier_retry() is still never called
on typical use case.
- kmalloc(GFP_KERNEL) might invoke page reclaim and it can spent few
seconds (not microseconds).

CONFIG_CPUMASK_OFFSTACK=1 and use GFP_ATOMIC
- cpumask is allocated on heap
- alloc_cpumask_var() is the wrapper of kmalloc()
- GFP_ATOMIC mean kmalloc never invoke page reclaim. IOW,
kmalloc() cost is nearly constant. (few or lots microseconds)
- OTOH, alloc_cpumask_var() might fail, at that time membarrier_retry()
is called.

So, My last mail talked about CONFIG_CPUMASK_OFFSTACK=1, but you mesured CONFIG_CPUMASK_OFFSTACK=0.
That's the reason why our conclusion is different.

>
> Also... it's pretty much a slow path anyway compared to the RCU
> read-side. I just don't want this slow path to scale badly.
>
> >
> > hmm...
> > Do you intend to GFP_ATOMIC?
>
> Would it help to lower the allocation overhead ?

No. If the system have lots free memory, GFP_ATOMIC and GFP_KERNEL
don't have any difference. but if the system have no free memory,
GFP_KERNEL might cause big latency.


Perhaps, It is no big issue. If the system have no free memory, another
syscall will invoke page reclaim soon although sys_membarrier avoid it.
I'm not sure. It depend on librcu latency policy.

Another alternative plan is,

if (!alloc_cpumask_var(&tmpmask, GFP_KERNEL)) {
err = -ENOMEM;
goto unlock;
}

and kill membarrier_retry(). because CONFIG_CPUMASK_OFFSTACK=1 is
only used for SGI big hpc machine, it mean nobody can test membarrier_retry().
Never called function doesn't have lots worth.

Thought?


2010-01-14 02:16:51

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

* KOSAKI Motohiro ([email protected]) wrote:
> > * KOSAKI Motohiro ([email protected]) wrote:
[...]
> > > It depend on what mean "constant overhead". kmalloc might cause
> > > page reclaim and undeterministic delay. I'm not sure (1) How much
> > > membarrier_retry() slower than smp_call_function_many and (2) Which do
> > > you think important average or worst performance. Only I note I don't
> > > think GFP_KERNEL is constant overhead.
> >
> > 10,000,000 sys_membarrier calls (varying the number of threads to which
> > we send IPIs), IPI-to-many, 8-core system:
> >
> > T=1: 0m20.173s
> > T=2: 0m20.506s
> > T=3: 0m22.632s
> > T=4: 0m24.759s
> > T=5: 0m26.633s
> > T=6: 0m29.654s
> > T=7: 0m30.669s
> >
> > Just doing local mb()+single IPI to T other threads:
> >
> > T=1: 0m18.801s
> > T=2: 0m29.086s
> > T=3: 0m46.841s
> > T=4: 0m53.758s
> > T=5: 1m10.856s
> > T=6: 1m21.142s
> > T=7: 1m38.362s
> >
> > So sending single IPIs adds about 1.5 microseconds per extra core. With
> > the IPI-to-many scheme, we add about 0.2 microseconds per extra core. So
> > we have a factor 10 gain in scalability. The initial cost of the cpumask
> > allocation (which seems to be allocated on the stack in my config) is
> > just about 1.4 microseconds. So here, we only have a small gain for the
> > 1 IPI case, which does not justify the added complexity of dealing with
> > it differently.
>
> I'd like to discuss to separate CONFIG_CPUMASK_OFFSTACK=1 and CONFIG_CPUMASK_OFFSTACK=0.
>
> CONFIG_CPUMASK_OFFSTACK=0 (your config)
> - cpumask is allocated on stask
> - alloc_cpumask_var() is nop (yes, nop is constant overhead ;)
> - alloc_cpumask_var() always return 1, then membarrier_retry() is never called.
> - alloc_cpumask_var() ignore GFP_KERNEL parameter
>
> CONFIG_CPUMASK_OFFSTACK=1 and use GFP_KERNEL
> - cpumask is allocated on heap
> - alloc_cpumask_var() is the wrapper of kmalloc()
> - GFP_KERNEL parameter is passed kmalloc
> - GFP_KERNEL mean alloc_cpumask_var() always return 1, except
> oom-killer case. IOW, membarrier_retry() is still never called
> on typical use case.
> - kmalloc(GFP_KERNEL) might invoke page reclaim and it can spent few
> seconds (not microseconds).
>
> CONFIG_CPUMASK_OFFSTACK=1 and use GFP_ATOMIC
> - cpumask is allocated on heap
> - alloc_cpumask_var() is the wrapper of kmalloc()
> - GFP_ATOMIC mean kmalloc never invoke page reclaim. IOW,
> kmalloc() cost is nearly constant. (few or lots microseconds)
> - OTOH, alloc_cpumask_var() might fail, at that time membarrier_retry()
> is called.
>
> So, My last mail talked about CONFIG_CPUMASK_OFFSTACK=1, but you mesured CONFIG_CPUMASK_OFFSTACK=0.
> That's the reason why our conclusion is different.

I would have to put my system in OOM condition anyway to measure the
page reclaim overhead. Given that sys_membarrier is not exactly a fast
path, I don't think it matters _that much_.

Hrm. Well, given the "expedited" nature of the system call, it might
come as a surprise to have to wait for page reclaim, and surprises are
not good. OTOH, I don't want to allow users to easily consume all the
GFP_ATOMIC pool. But I think it's unlikely, as we are bounded by the
number of processors which can concurrently run sys_membarrier().

>
> >
> > Also... it's pretty much a slow path anyway compared to the RCU
> > read-side. I just don't want this slow path to scale badly.
> >
> > >
> > > hmm...
> > > Do you intend to GFP_ATOMIC?
> >
> > Would it help to lower the allocation overhead ?
>
> No. If the system have lots free memory, GFP_ATOMIC and GFP_KERNEL
> don't have any difference. but if the system have no free memory,
> GFP_KERNEL might cause big latency.

Having a somewhat bounded latency is good for a synchronization
primitive, even for the slow path.

>
>
> Perhaps, It is no big issue. If the system have no free memory, another
> syscall will invoke page reclaim soon although sys_membarrier avoid it.
> I'm not sure. It depend on librcu latency policy.

I'd like to stay on the safe side. If you tell me that there is no risk
to let users exhaust GFP_ATOMIC pools prematurely, then I'll use it.

>
> Another alternative plan is,
>
> if (!alloc_cpumask_var(&tmpmask, GFP_KERNEL)) {
> err = -ENOMEM;
> goto unlock;
> }
>
> and kill membarrier_retry(). because CONFIG_CPUMASK_OFFSTACK=1 is
> only used for SGI big hpc machine, it mean nobody can test membarrier_retry().
> Never called function doesn't have lots worth.
>
> Thought?

I don't want to rely on a system call which can fail at arbitrary points
in the program to create a synchronization primitive. Currently (with
the forthcoming v6 patch), I can test if the system call exists and if
the flags are supported at library init time (by checking -ENOSYS and
-EINVAL return values). From that point on, I don't want to check error
values anymore. This means that a system call that fails on a given
kernel will _always_ fail. The same is true for the opposite. This is
why not returning -ENOMEM is important here.

So I rather prefer to have one single simple failure handler in the
kernel, even if it is not often used, than to have multiple subtly
different error-handling of -ENOMEM at the user-space caller sites,
resulting in an expectable mess. These error handlers won't be tested
any more than the one located in the kernel.

Thanks,

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2010-01-14 02:25:43

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

> * KOSAKI Motohiro ([email protected]) wrote:
> > > * KOSAKI Motohiro ([email protected]) wrote:
> [...]
> > > > It depend on what mean "constant overhead". kmalloc might cause
> > > > page reclaim and undeterministic delay. I'm not sure (1) How much
> > > > membarrier_retry() slower than smp_call_function_many and (2) Which do
> > > > you think important average or worst performance. Only I note I don't
> > > > think GFP_KERNEL is constant overhead.
> > >
> > > 10,000,000 sys_membarrier calls (varying the number of threads to which
> > > we send IPIs), IPI-to-many, 8-core system:
> > >
> > > T=1: 0m20.173s
> > > T=2: 0m20.506s
> > > T=3: 0m22.632s
> > > T=4: 0m24.759s
> > > T=5: 0m26.633s
> > > T=6: 0m29.654s
> > > T=7: 0m30.669s
> > >
> > > Just doing local mb()+single IPI to T other threads:
> > >
> > > T=1: 0m18.801s
> > > T=2: 0m29.086s
> > > T=3: 0m46.841s
> > > T=4: 0m53.758s
> > > T=5: 1m10.856s
> > > T=6: 1m21.142s
> > > T=7: 1m38.362s
> > >
> > > So sending single IPIs adds about 1.5 microseconds per extra core. With
> > > the IPI-to-many scheme, we add about 0.2 microseconds per extra core. So
> > > we have a factor 10 gain in scalability. The initial cost of the cpumask
> > > allocation (which seems to be allocated on the stack in my config) is
> > > just about 1.4 microseconds. So here, we only have a small gain for the
> > > 1 IPI case, which does not justify the added complexity of dealing with
> > > it differently.
> >
> > I'd like to discuss to separate CONFIG_CPUMASK_OFFSTACK=1 and CONFIG_CPUMASK_OFFSTACK=0.
> >
> > CONFIG_CPUMASK_OFFSTACK=0 (your config)
> > - cpumask is allocated on stask
> > - alloc_cpumask_var() is nop (yes, nop is constant overhead ;)
> > - alloc_cpumask_var() always return 1, then membarrier_retry() is never called.
> > - alloc_cpumask_var() ignore GFP_KERNEL parameter
> >
> > CONFIG_CPUMASK_OFFSTACK=1 and use GFP_KERNEL
> > - cpumask is allocated on heap
> > - alloc_cpumask_var() is the wrapper of kmalloc()
> > - GFP_KERNEL parameter is passed kmalloc
> > - GFP_KERNEL mean alloc_cpumask_var() always return 1, except
> > oom-killer case. IOW, membarrier_retry() is still never called
> > on typical use case.
> > - kmalloc(GFP_KERNEL) might invoke page reclaim and it can spent few
> > seconds (not microseconds).
> >
> > CONFIG_CPUMASK_OFFSTACK=1 and use GFP_ATOMIC
> > - cpumask is allocated on heap
> > - alloc_cpumask_var() is the wrapper of kmalloc()
> > - GFP_ATOMIC mean kmalloc never invoke page reclaim. IOW,
> > kmalloc() cost is nearly constant. (few or lots microseconds)
> > - OTOH, alloc_cpumask_var() might fail, at that time membarrier_retry()
> > is called.
> >
> > So, My last mail talked about CONFIG_CPUMASK_OFFSTACK=1, but you mesured CONFIG_CPUMASK_OFFSTACK=0.
> > That's the reason why our conclusion is different.
>
> I would have to put my system in OOM condition anyway to measure the
> page reclaim overhead. Given that sys_membarrier is not exactly a fast
> path, I don't think it matters _that much_.
>
> Hrm. Well, given the "expedited" nature of the system call, it might
> come as a surprise to have to wait for page reclaim, and surprises are
> not good. OTOH, I don't want to allow users to easily consume all the
> GFP_ATOMIC pool. But I think it's unlikely, as we are bounded by the
> number of processors which can concurrently run sys_membarrier().

GFP_NOWAIT prevent such consuming GFP_ATOMIC pool. but yes, you already
answered i wanted, "sys_membarrier is not exactly a fast path, I don't
think it matters _that much_.".
okay, i understand librcu latency policy. iow, i agree your patch.

Thanks lots explanation.

2010-01-14 09:09:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

On Wed, 2010-01-13 at 14:36 -0500, Mathieu Desnoyers wrote:
> * Peter Zijlstra ([email protected]) wrote:
> > On Tue, 2010-01-12 at 20:37 -0500, Mathieu Desnoyers wrote:
> > > + for_each_cpu(cpu, tmpmask) {
> > > + spin_lock_irq(&cpu_rq(cpu)->lock);
> > > + mm = cpu_curr(cpu)->mm;
> > > + spin_unlock_irq(&cpu_rq(cpu)->lock);
> > > + if (current->mm != mm)
> > > + cpumask_clear_cpu(cpu, tmpmask);
> > > + }
> >
> > Why not:
> >
> > rcu_read_lock();
> > if (current->mm != cpu_curr(cpu)->mm)
> > cpumask_clear_cpu(cpu, tmpmask);
> > rcu_read_unlock();
> >
> > the RCU read lock ensures the task_struct obtained remains valid, and it
> > avoids taking the rq->lock.
> >
>
> If we go for a simple rcu_read_lock, I think that we need a smp_mb()
> after switch_to() updates the current task on the remote CPU, before it
> returns to user-space. Do we have this guarantee for all architectures ?
>
> So what I'm looking for, overall, is:
>
> schedule()
> ...
> switch_mm()
> smp_mb()
> clear mm_cpumask
> set mm_cpumask
> switch_to()
> update current task
> smp_mb()
>
> If we have that, then the rcu_read_lock should work.
>
> What the rq lock currently gives us is the guarantee that if the current
> thread changes on a remote CPU while we are not holding this lock, then
> a full scheduler execution is performed, which implies a memory barrier
> if we change the current thread (it does, right ?).

I'm not quite seeing it, we have 4 possibilities, switches between
threads with:

a) our mm, another mm

- if we observe the former, we'll send an IPI (redundant)
- if we observe the latter, the switch_mm will have issued an mb

b) another mm, our mm

- if we observe the former, we're good because the cpu didn't run our
thread when we called sys_membarrier()
- if we observe the latter, we'll send an IPI (redundant)

c) our mm, our mm

- no matter which task we observe, we'll match and send an IPI

d) another mm, another mm

- no matter which task we observe, we'll not match and not send an
IPI.


Or am I missing something?

2010-01-14 16:26:15

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

* Peter Zijlstra ([email protected]) wrote:
> On Wed, 2010-01-13 at 14:36 -0500, Mathieu Desnoyers wrote:
> > * Peter Zijlstra ([email protected]) wrote:
> > > On Tue, 2010-01-12 at 20:37 -0500, Mathieu Desnoyers wrote:
> > > > + for_each_cpu(cpu, tmpmask) {
> > > > + spin_lock_irq(&cpu_rq(cpu)->lock);
> > > > + mm = cpu_curr(cpu)->mm;
> > > > + spin_unlock_irq(&cpu_rq(cpu)->lock);
> > > > + if (current->mm != mm)
> > > > + cpumask_clear_cpu(cpu, tmpmask);
> > > > + }
> > >
> > > Why not:
> > >
> > > rcu_read_lock();
> > > if (current->mm != cpu_curr(cpu)->mm)
> > > cpumask_clear_cpu(cpu, tmpmask);
> > > rcu_read_unlock();
> > >
> > > the RCU read lock ensures the task_struct obtained remains valid, and it
> > > avoids taking the rq->lock.
> > >
> >
> > If we go for a simple rcu_read_lock, I think that we need a smp_mb()
> > after switch_to() updates the current task on the remote CPU, before it
> > returns to user-space. Do we have this guarantee for all architectures ?
> >
> > So what I'm looking for, overall, is:
> >
> > schedule()
> > ...
> > switch_mm()
> > smp_mb()
> > clear mm_cpumask
> > set mm_cpumask
> > switch_to()
> > update current task
> > smp_mb()
> >
> > If we have that, then the rcu_read_lock should work.
> >
> > What the rq lock currently gives us is the guarantee that if the current
> > thread changes on a remote CPU while we are not holding this lock, then
> > a full scheduler execution is performed, which implies a memory barrier
> > if we change the current thread (it does, right ?).
>
> I'm not quite seeing it, we have 4 possibilities, switches between
> threads with:
>
> a) our mm, another mm
>
> - if we observe the former, we'll send an IPI (redundant)
> - if we observe the latter, the switch_mm will have issued an mb
>
> b) another mm, our mm
>
> - if we observe the former, we're good because the cpu didn't run our
> thread when we called sys_membarrier()
> - if we observe the latter, we'll send an IPI (redundant)

It's this scenario that is causing problem. Let's consider this
execution:

CPU 0 (membarrier) CPU 1 (another mm -> our mm)
<kernel-space> <kernel-space>
switch_mm()
smp_mb()
clear_mm_cpumask()
set_mm_cpumask()
smp_mb() (by load_cr3() on x86)
switch_to()
mm_cpumask includes CPU 1
rcu_read_lock()
if (CPU 1 mm != our mm)
skip CPU 1.
rcu_read_unlock()
current = next (1)
<switch back to user-space>
read-lock()
read gp, store local gp
barrier()
access critical section (2)

So if we don't have any memory barrier between (1) and (2), the memory
operations can be reordered in such a way that CPU 0 will not send IPI
to a CPU that would need to have it's barrier() promoted into a
smp_mb().

Replacing these kernel rcu_read_lock/unlock() by rq locks ensures that
when the scheduler runs concurrently on another CPU, _all_ the scheduling
code is executed atomically wrt the spin lock taken on cpu 0.

When x86 uses iret to return to user-space, then we have a serializing
instruction. But if it uses sysexit, or if we are on a different
architecture, are we sure that a memory barrier is issued before
returning to user-space ?

Thanks,

Mathieu

>
> c) our mm, our mm
>
> - no matter which task we observe, we'll match and send an IPI
>
> d) another mm, another mm
>
> - no matter which task we observe, we'll not match and not send an
> IPI.
>
>
> Or am I missing something?
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2010-01-14 17:04:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

On Thu, 2010-01-14 at 11:26 -0500, Mathieu Desnoyers wrote:

> It's this scenario that is causing problem. Let's consider this
> execution:
>
> CPU 0 (membarrier) CPU 1 (another mm -> our mm)
> <kernel-space> <kernel-space>
> switch_mm()
> smp_mb()
> clear_mm_cpumask()
> set_mm_cpumask()
> smp_mb() (by load_cr3() on x86)
> switch_to()
> mm_cpumask includes CPU 1
> rcu_read_lock()
> if (CPU 1 mm != our mm)
> skip CPU 1.
> rcu_read_unlock()
> current = next (1)
> <switch back to user-space>
> read-lock()
> read gp, store local gp
> barrier()
> access critical section (2)
>
> So if we don't have any memory barrier between (1) and (2), the memory
> operations can be reordered in such a way that CPU 0 will not send IPI
> to a CPU that would need to have it's barrier() promoted into a
> smp_mb().

I'm still not getting it, sure we don't send an IPI, but it will have
done an mb() in switch_mm() to become our mm, so even without the IPI it
will have executed that mb we were after.


2010-01-14 17:54:53

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

* Peter Zijlstra ([email protected]) wrote:
> On Thu, 2010-01-14 at 11:26 -0500, Mathieu Desnoyers wrote:
>
> > It's this scenario that is causing problem. Let's consider this
> > execution:
> >

(slightly augmented)

CPU 0 (membarrier) CPU 1 (another mm -> our mm)
<user-space>
<kernel-space>
switch_mm()
smp_mb()
clear_mm_cpumask()
set_mm_cpumask()
smp_mb() (by load_cr3() on x86)
switch_to()
memory access before membarrier
<call sys_membarrier()>
smp_mb()
mm_cpumask includes CPU 1
rcu_read_lock()
if (CPU 1 mm != our mm)
skip CPU 1.
rcu_read_unlock()
smp_mb()
<return to user-space>
current = next (1)
<switch back to user-space>
urcu read lock()
read gp
store local gp (2)
barrier()
access critical section data (3)
memory access after membarrier

So if we don't have any memory barrier between (1) and (3), the memory
operations can be reordered in such a way that CPU 0 will not send IPI
to a CPU that would need to have it's barrier() promoted into a
smp_mb().

>
> I'm still not getting it, sure we don't send an IPI, but it will have
> done an mb() in switch_mm() to become our mm, so even without the IPI it
> will have executed that mb we were after.

The augmented race window above shows that it would be possible for (2)
and (3) to be reordered across the barrier(), and therefore the critical
section access could spill over a rcu-unlocked state.

Thanks,

Mathieu


--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2010-01-14 18:37:42

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

* Mathieu Desnoyers ([email protected]) wrote:
> * Peter Zijlstra ([email protected]) wrote:
> > On Thu, 2010-01-14 at 11:26 -0500, Mathieu Desnoyers wrote:
> >
> > > It's this scenario that is causing problem. Let's consider this
> > > execution:
> > >
>
> (slightly augmented)
>
> CPU 0 (membarrier) CPU 1 (another mm -> our mm)
> <user-space>
> <kernel-space>
> switch_mm()
> smp_mb()
> clear_mm_cpumask()
> set_mm_cpumask()
> smp_mb() (by load_cr3() on x86)
> switch_to()
> memory access before membarrier
> <call sys_membarrier()>
> smp_mb()
> mm_cpumask includes CPU 1
> rcu_read_lock()
> if (CPU 1 mm != our mm)
> skip CPU 1.
> rcu_read_unlock()
> smp_mb()
> <return to user-space>
> current = next (1)
> <switch back to user-space>
> urcu read lock()
> read gp
> store local gp (2)
> barrier()
> access critical section data (3)
> memory access after membarrier
>
> So if we don't have any memory barrier between (1) and (3), the memory
> operations can be reordered in such a way that CPU 0 will not send IPI
> to a CPU that would need to have it's barrier() promoted into a
> smp_mb().
>
> >
> > I'm still not getting it, sure we don't send an IPI, but it will have
> > done an mb() in switch_mm() to become our mm, so even without the IPI it
> > will have executed that mb we were after.
>
> The augmented race window above shows that it would be possible for (2)
> and (3) to be reordered across the barrier(), and therefore the critical
> section access could spill over a rcu-unlocked state.

To make this painfully clear, I'll reorder the accesses to match that of
the CPU to memory:

CPU 0 (membarrier) CPU 1 (another mm -our mm)
<user-space>
<kernel-space>
switch_mm()
smp_mb()
clear_mm_cpumask()
set_mm_cpumask()
smp_mb() (by load_cr3() on x86)
switch_to()
<buffered current = next>
<switch back to user-space>
urcu read lock()
access critical section data (3)
memory access before membarrier
<call sys_membarrier()>
smp_mb()
mm_cpumask includes CPU 1
rcu_read_lock()
if (CPU 1 mm != our mm)
skip CPU 1.
rcu_read_unlock()
smp_mb()
<return to user-space>
memory access after membarrier
current = next (1) (buffer flush)
read gp
store local gp (2)

This should make the problem a bit more evident. Access (3) is done
outside of the read-side C.S. as far as the userspace synchronize_rcu()
is concerned.

Thanks,

Mathieu


--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2010-01-14 18:50:47

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

On Thu, 2010-01-14 at 12:54 -0500, Mathieu Desnoyers wrote:
> * Peter Zijlstra ([email protected]) wrote:
> > On Thu, 2010-01-14 at 11:26 -0500, Mathieu Desnoyers wrote:
> >
> > > It's this scenario that is causing problem. Let's consider this
> > > execution:
> > >
>
> (slightly augmented)
>
> CPU 0 (membarrier) CPU 1 (another mm -> our mm)
> <user-space>
> <kernel-space>
> switch_mm()
> smp_mb()
> clear_mm_cpumask()
> set_mm_cpumask()
> smp_mb() (by load_cr3() on x86)
> switch_to()
> memory access before membarrier
> <call sys_membarrier()>
> smp_mb()
> mm_cpumask includes CPU 1
> rcu_read_lock()
> if (CPU 1 mm != our mm)

But here, CPU 1 updated its mm already and did a
smp_mb, won't that make us send the smp_mb
anyway?

-- Steve

> skip CPU 1.
> rcu_read_unlock()
> smp_mb()
> <return to user-space>
> current = next (1)
> <switch back to user-space>
> urcu read lock()
> read gp
> store local gp (2)
> barrier()
> access critical section data (3)
> memory access after membarrier

2010-01-14 18:52:16

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

On Thu, 2010-01-14 at 13:37 -0500, Mathieu Desnoyers wrote:

> To make this painfully clear, I'll reorder the accesses to match that of
> the CPU to memory:
>
> CPU 0 (membarrier) CPU 1 (another mm -our mm)
> <user-space>
> <kernel-space>
> switch_mm()
> smp_mb()
> clear_mm_cpumask()
> set_mm_cpumask()
> smp_mb() (by load_cr3() on x86)
> switch_to()
> <buffered current = next>
> <switch back to user-space>
> urcu read lock()
> access critical section data (3)
> memory access before membarrier
> <call sys_membarrier()>
> smp_mb()
> mm_cpumask includes CPU 1
> rcu_read_lock()
> if (CPU 1 mm != our mm)
> skip CPU 1.

I still don't see how the above conditional fails?

-- Steve

> rcu_read_unlock()
> smp_mb()
> <return to user-space>
> memory access after membarrier
> current = next (1) (buffer flush)
> read gp
> store local gp (2)
>
> This should make the problem a bit more evident. Access (3) is done
> outside of the read-side C.S. as far as the userspace synchronize_rcu()
> is concerned.
>
> Thanks,
>
> Mathieu
>
>

2010-01-14 19:34:04

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

* Steven Rostedt ([email protected]) wrote:
> On Thu, 2010-01-14 at 13:37 -0500, Mathieu Desnoyers wrote:
>
> > To make this painfully clear, I'll reorder the accesses to match that of
> > the CPU to memory:
> >
> > CPU 0 (membarrier) CPU 1 (another mm -our mm)
> > <user-space>
> > <kernel-space>
> > switch_mm()
> > smp_mb()
> > clear_mm_cpumask()
> > set_mm_cpumask()
> > smp_mb() (by load_cr3() on x86)
> > switch_to()
> > <buffered current = next>
> > <switch back to user-space>
> > urcu read lock()
> > access critical section data (3)
> > memory access before membarrier
> > <call sys_membarrier()>
> > smp_mb()
> > mm_cpumask includes CPU 1
> > rcu_read_lock()
> > if (CPU 1 mm != our mm)
> > skip CPU 1.
>
> I still don't see how the above conditional fails?

First, I just want to fix one detail I had wrong. It does not change the
end result, but it changes the order of the scenario:

A cpu "current" task struct is not the same thing as that same CPU
rq->curr. So we are talking about the rq->curr update here, not the cpu
"current" task (as I mistakenly assumed previously).

if (CPU 1 mm != our mm) translates into:

if (cpu_curr(1)->mm != current->mm)

where cpu_curr(cpu) is:

#define cpu_rq(cpu) (&per_cpu(runqueues, (cpu)))
#define cpu_curr(cpu) (cpu_rq(cpu)->curr)

struct rq "curr" field is a struct task_struct *, updated by
schedule() before calling context_switch().

So the requirement is that we need a smp_mb() before and after rq->curr
update in schedule(). The smp_mb() after the update is ensured by
context_switch() -> switch_mm() -> load_cr3(). However, updating my
scenario to match the fact that we are really talking about rq->curr
update here (which happens _before_ switch_mm() and not after), we can
see that the problematic case happens if there is no smp_mb() before
rq->curr update:

It's a case where CPU 1 switches from our mm to another mm:

CPU 0 (membarrier) CPU 1 (another mm -our mm)
<user-space> <user-space>
<buffered access C.S. data>
urcu read unlock()
barrier()
store local gp
<kernel-space>
rq->curr = next (1)
memory access before membarrier
<call sys_membarrier()>
smp_mb()
mm_cpumask includes CPU 1
rcu_read_lock()
if (cpu_curr(1)->mm != our mm)
skip CPU 1 -> here, rq->curr new version is already visible
rcu_read_unlock()
smp_mb()
<return to user-space>
memory access after membarrier
-> this is where we allow freeing
the old structure although the
buffered access C.S. data is
still in flight.
User-space access C.S. data (2)
(buffer flush)
switch_mm()
smp_mb()
clear_mm_cpumask()
set_mm_cpumask()
smp_mb() (by load_cr3() on x86)
switch_to()
<buffered current = next>
<switch back to user-space>
current = next (1) (buffer flush)
access critical section data (3)

As we can see, the reordering of (1) and (2) is problematic, as it lets
the check skip over a CPU that have global side-effects not committed to
memory yet.

Hopefully this explanation helps ?

Thanks,

Mathieu

>
> -- Steve
>
> > rcu_read_unlock()
> > smp_mb()
> > <return to user-space>
> > memory access after membarrier
> > current = next (1) (buffer flush)
> > read gp
> > store local gp (2)
> >
> > This should make the problem a bit more evident. Access (3) is done
> > outside of the read-side C.S. as far as the userspace synchronize_rcu()
> > is concerned.
> >
> > Thanks,
> >
> > Mathieu
> >
> >
>
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2010-01-14 21:26:59

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

On Thu, 2010-01-14 at 14:33 -0500, Mathieu Desnoyers wrote:

> It's a case where CPU 1 switches from our mm to another mm:
>
> CPU 0 (membarrier) CPU 1 (another mm -our mm)
> <user-space> <user-space>
> <buffered access C.S. data>
> urcu read unlock()
> barrier()
> store local gp
> <kernel-space>
> rq->curr = next (1)
> memory access before membarrier
> <call sys_membarrier()>
> smp_mb()
> mm_cpumask includes CPU 1
> rcu_read_lock()
> if (cpu_curr(1)->mm != our mm)
> skip CPU 1 -> here, rq->curr new version is already visible
> rcu_read_unlock()
> smp_mb()
> <return to user-space>
> memory access after membarrier
> -> this is where we allow freeing
> the old structure although the
> buffered access C.S. data is
> still in flight.

OK, yeah this makes much more sense.

-- Steve

> User-space access C.S. data (2)
> (buffer flush)
> switch_mm()
> smp_mb()
> clear_mm_cpumask()
> set_mm_cpumask()
> smp_mb() (by load_cr3() on x86)
> switch_to()
> <buffered current = next>
> <switch back to user-space>
> current = next (1) (buffer flush)
> access critical section data (3)
>
> As we can see, the reordering of (1) and (2) is problematic, as it lets
> the check skip over a CPU that have global side-effects not committed to
> memory yet.

2010-01-19 16:49:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

On Thu, 2010-01-14 at 11:26 -0500, Mathieu Desnoyers wrote:
> * Peter Zijlstra ([email protected]) wrote:
> > On Wed, 2010-01-13 at 14:36 -0500, Mathieu Desnoyers wrote:
> > > * Peter Zijlstra ([email protected]) wrote:
> > > > On Tue, 2010-01-12 at 20:37 -0500, Mathieu Desnoyers wrote:
> > > > > + for_each_cpu(cpu, tmpmask) {
> > > > > + spin_lock_irq(&cpu_rq(cpu)->lock);
> > > > > + mm = cpu_curr(cpu)->mm;
> > > > > + spin_unlock_irq(&cpu_rq(cpu)->lock);
> > > > > + if (current->mm != mm)
> > > > > + cpumask_clear_cpu(cpu, tmpmask);
> > > > > + }
> > > >
> > > > Why not:
> > > >
> > > > rcu_read_lock();
> > > > if (current->mm != cpu_curr(cpu)->mm)
> > > > cpumask_clear_cpu(cpu, tmpmask);
> > > > rcu_read_unlock();
> > > >
> > > > the RCU read lock ensures the task_struct obtained remains valid, and it
> > > > avoids taking the rq->lock.
> > > >
> > >
> > > If we go for a simple rcu_read_lock, I think that we need a smp_mb()
> > > after switch_to() updates the current task on the remote CPU, before it
> > > returns to user-space. Do we have this guarantee for all architectures ?
> > >
> > > So what I'm looking for, overall, is:
> > >
> > > schedule()
> > > ...
> > > switch_mm()
> > > smp_mb()
> > > clear mm_cpumask
> > > set mm_cpumask
> > > switch_to()
> > > update current task
> > > smp_mb()
> > >
> > > If we have that, then the rcu_read_lock should work.
> > >
> > > What the rq lock currently gives us is the guarantee that if the current
> > > thread changes on a remote CPU while we are not holding this lock, then
> > > a full scheduler execution is performed, which implies a memory barrier
> > > if we change the current thread (it does, right ?).
> >
> > I'm not quite seeing it, we have 4 possibilities, switches between
> > threads with:
> >
> > a) our mm, another mm
> >
> > - if we observe the former, we'll send an IPI (redundant)
> > - if we observe the latter, the switch_mm will have issued an mb
> >
> > b) another mm, our mm
> >
> > - if we observe the former, we're good because the cpu didn't run our
> > thread when we called sys_membarrier()
> > - if we observe the latter, we'll send an IPI (redundant)
>
> It's this scenario that is causing problem. Let's consider this
> execution:
>
> CPU 0 (membarrier) CPU 1 (another mm -> our mm)
> <kernel-space> <kernel-space>
> switch_mm()
> smp_mb()
> clear_mm_cpumask()
> set_mm_cpumask()
> smp_mb() (by load_cr3() on x86)
> switch_to()
> mm_cpumask includes CPU 1
> rcu_read_lock()
> if (CPU 1 mm != our mm)
> skip CPU 1.
> rcu_read_unlock()
> current = next (1)

OK, so on x86 current uses esp and will be flipped somewhere in the
switch_to() magic, cpu_curr(cpu) as used by CPU 0 uses rq->curr, which
will be set before context_switch() and that always implies a mb() for
non matching ->mm's [*]

> <switch back to user-space>
> read-lock()
> read gp, store local gp
> barrier()
> access critical section (2)
>
> So if we don't have any memory barrier between (1) and (2), the memory
> operations can be reordered in such a way that CPU 0 will not send IPI
> to a CPU that would need to have it's barrier() promoted into a
> smp_mb().

OK, so I'm utterly failing to make sense of the above, do you need more
than the 2 cpus discussed to make it go boom?

> Replacing these kernel rcu_read_lock/unlock() by rq locks ensures that
> when the scheduler runs concurrently on another CPU, _all_ the scheduling
> code is executed atomically wrt the spin lock taken on cpu 0.

Sure, but taking the rq->lock is fairly heavy handed.

> When x86 uses iret to return to user-space, then we have a serializing
> instruction. But if it uses sysexit, or if we are on a different
> architecture, are we sure that a memory barrier is issued before
> returning to user-space ?

[*] and possibly also for matching ->mm's, because:

OK, so I had a quick look at the switch_to() magic, and from what I can
make of it it implies an mb, if only because poking at the segment
registers implies LOCK semantics.

2010-01-19 17:11:33

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

* Peter Zijlstra ([email protected]) wrote:
[...]
> > It's this scenario that is causing problem. Let's consider this
> > execution:
> >
> > CPU 0 (membarrier) CPU 1 (another mm -> our mm)
> > <kernel-space> <kernel-space>
> > switch_mm()
> > smp_mb()
> > clear_mm_cpumask()
> > set_mm_cpumask()
> > smp_mb() (by load_cr3() on x86)
> > switch_to()
> > mm_cpumask includes CPU 1
> > rcu_read_lock()
> > if (CPU 1 mm != our mm)
> > skip CPU 1.
> > rcu_read_unlock()
> > current = next (1)
>
> OK, so on x86 current uses esp and will be flipped somewhere in the
> switch_to() magic, cpu_curr(cpu) as used by CPU 0 uses rq->curr, which
> will be set before context_switch() and that always implies a mb() for
> non matching ->mm's [*]

Hi Peter,

Please refer to the discussion with Steven further down this thread
(http://lkml.org/lkml/2010/1/14/319), which I update the scenario
when I figured out that "current" and rq->curr are indeed two different
things. It's rq->curr we are interested into here, not "current" as I
previously thought. (sorry about the mixup)

>
> > <switch back to user-space>
> > read-lock()
> > read gp, store local gp
> > barrier()
> > access critical section (2)
> >
> > So if we don't have any memory barrier between (1) and (2), the memory
> > operations can be reordered in such a way that CPU 0 will not send IPI
> > to a CPU that would need to have it's barrier() promoted into a
> > smp_mb().
>
> OK, so I'm utterly failing to make sense of the above, do you need more
> than the 2 cpus discussed to make it go boom?
>
> > Replacing these kernel rcu_read_lock/unlock() by rq locks ensures that
> > when the scheduler runs concurrently on another CPU, _all_ the scheduling
> > code is executed atomically wrt the spin lock taken on cpu 0.
>
> Sure, but taking the rq->lock is fairly heavy handed.
>
> > When x86 uses iret to return to user-space, then we have a serializing
> > instruction. But if it uses sysexit, or if we are on a different
> > architecture, are we sure that a memory barrier is issued before
> > returning to user-space ?
>
> [*] and possibly also for matching ->mm's, because:
>
> OK, so I had a quick look at the switch_to() magic, and from what I can
> make of it it implies an mb, if only because poking at the segment
> registers implies LOCK semantics.

Can you have a look at the updated scenario and reply with questions
that might arise ?

Thanks!

Mathieu

>
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2010-01-19 17:30:55

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

On Tue, 2010-01-19 at 17:47 +0100, Peter Zijlstra wrote:
> On Thu, 2010-01-14 at 11:26 -0500, Mathieu Desnoyers wrote:

> > It's this scenario that is causing problem. Let's consider this
> > execution:
> >
> > CPU 0 (membarrier) CPU 1 (another mm -> our mm)
> > <kernel-space> <kernel-space>
> > switch_mm()
> > smp_mb()
> > clear_mm_cpumask()
> > set_mm_cpumask()
> > smp_mb() (by load_cr3() on x86)
> > switch_to()
> > mm_cpumask includes CPU 1
> > rcu_read_lock()
> > if (CPU 1 mm != our mm)
> > skip CPU 1.
> > rcu_read_unlock()
> > current = next (1)
>
> OK, so on x86 current uses esp and will be flipped somewhere in the
> switch_to() magic, cpu_curr(cpu) as used by CPU 0 uses rq->curr, which
> will be set before context_switch() and that always implies a mb() for
> non matching ->mm's [*]


This explanation by Mathieu still does not show the issue. He finally
explained it correctly here:

http://lkml.org/lkml/2010/1/14/319

The issue is before switch_to. If the read side reads in a pointer but
gets preempted and goes to the kernel, and the kernel updates next
(before the switch_to), and the write side does sys_membarrier(), the:

if (cpu_curr(1)->mm != our mm)

will fail, so no memory barrier will happen yet.

Then the userspace side of the writer will free or change the data, but
the reader side read is still stuck in the bus (still before the
switch_to). Then the reader can get a bad pointer.

Here's a simple explanation:

CPU 0 CPU 1
---------- -----------
<userspace> <userspace>
obj = get_obj();
rcu_read_lock();
obj = get_obj(); <load of obj>
<obj still being retrieved from memory>
<interrupt, schedule out>
<kernelspace>
rq->curr = next;
<bus stall>
<obj still in memory>

sys_membarrier();
<kernel space>
if (curr_cpu(1)->mm != mm) <-- false
<userspace>

modify object

switch_to()
<obj finally is loaded>
<schedule kernel thread>
<schedule task back>
<userspace>
refer(obj) <-- corruption!




>
> > <switch back to user-space>
> > read-lock()
> > read gp, store local gp
> > barrier()
> > access critical section (2)
> >
> > So if we don't have any memory barrier between (1) and (2), the memory
> > operations can be reordered in such a way that CPU 0 will not send IPI
> > to a CPU that would need to have it's barrier() promoted into a
> > smp_mb().
>
> OK, so I'm utterly failing to make sense of the above, do you need more
> than the 2 cpus discussed to make it go boom?

Just 2 cpu's as explained above.

>
> > Replacing these kernel rcu_read_lock/unlock() by rq locks ensures that
> > when the scheduler runs concurrently on another CPU, _all_ the scheduling
> > code is executed atomically wrt the spin lock taken on cpu 0.
>
> Sure, but taking the rq->lock is fairly heavy handed.

Yes, but for now it seems to be the only safe way.

>
> > When x86 uses iret to return to user-space, then we have a serializing
> > instruction. But if it uses sysexit, or if we are on a different
> > architecture, are we sure that a memory barrier is issued before
> > returning to user-space ?
>
> [*] and possibly also for matching ->mm's, because:
>
> OK, so I had a quick look at the switch_to() magic, and from what I can
> make of it it implies an mb, if only because poking at the segment
> registers implies LOCK semantics.

The problem is that this issue can be caused before we get to
switch_to().

-- Steve

2010-01-19 18:39:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

On Thu, 2010-01-14 at 14:33 -0500, Mathieu Desnoyers wrote:
> It's a case where CPU 1 switches from our mm to another mm:
>
> CPU 0 (membarrier) CPU 1 (another mm -our mm)
> <user-space> <user-space>
> <buffered access C.S. data>
> urcu read unlock()
> barrier()
> store local gp
> <kernel-space>

OK, so the question is how we end up here, if its though interrupt
preemption I think the interrupt delivery will imply an mb, if its a
blocking syscall, the set_task_state() mb [*] should be there.

Then we also do:

clear_tsk_need_resched()

which is an atomic bitop (although does not imply a full barrier
per-se).

> rq->curr = next (1)
> memory access before membarrier
> <call sys_membarrier()>
> smp_mb()
> mm_cpumask includes CPU 1
> rcu_read_lock()
> if (cpu_curr(1)->mm != our mm)
> skip CPU 1 -> here, rq->curr new version is already visible
> rcu_read_unlock()
> smp_mb()
> <return to user-space>
> memory access after membarrier
> -> this is where we allow freeing
> the old structure although the
> buffered access C.S. data is
> still in flight.
> User-space access C.S. data (2)
> (buffer flush)
> switch_mm()
> smp_mb()
> clear_mm_cpumask()
> set_mm_cpumask()
> smp_mb() (by load_cr3() on x86)
> switch_to()
> <buffered current = next>
> <switch back to user-space>
> current = next (1) (buffer flush)
> access critical section data (3)
>
> As we can see, the reordering of (1) and (2) is problematic, as it lets
> the check skip over a CPU that have global side-effects not committed to
> memory yet.

Right, this one I get, thanks!


So about that [*], Oleg, kernel/signal.c:SYSCALL_DEFINE0(pause) does:

SYSCALL_DEFINE0(pause)
{
current->state = TASK_INTERRUPTIBLE;
schedule();
return -ERESTARTNOHAND;
}

Isn't that ->state assignment buggy? If so, there appear to be quite a
few such sites, which worries me.

2010-01-19 19:07:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

On Tue, 2010-01-19 at 19:37 +0100, Peter Zijlstra wrote:
> On Thu, 2010-01-14 at 14:33 -0500, Mathieu Desnoyers wrote:
> > It's a case where CPU 1 switches from our mm to another mm:
> >
> > CPU 0 (membarrier) CPU 1 (another mm -our mm)
> > <user-space> <user-space>
> > <buffered access C.S. data>
> > urcu read unlock()
> > barrier()
> > store local gp
> > <kernel-space>
>
> OK, so the question is how we end up here, if its though interrupt
> preemption I think the interrupt delivery will imply an mb,

I keep thinking that, but I think we actually refuted that in an earlier
discussion on this patch.

> if its a
> blocking syscall, the set_task_state() mb [*] should be there.
>
> Then we also do:
>
> clear_tsk_need_resched()
>
> which is an atomic bitop (although does not imply a full barrier
> per-se).
>
> > rq->curr = next (1)

We could possibly look at placing that assignment in context_switch()
between switch_mm() and switch_to(), which should provide a mb before
and after I think, Ingo?

2010-01-19 19:43:51

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

On Tue, 2010-01-19 at 19:37 +0100, Peter Zijlstra wrote:

> So about that [*], Oleg, kernel/signal.c:SYSCALL_DEFINE0(pause) does:
>
> SYSCALL_DEFINE0(pause)
> {
> current->state = TASK_INTERRUPTIBLE;
> schedule();
> return -ERESTARTNOHAND;
> }
>
> Isn't that ->state assignment buggy? If so, there appear to be quite a
> few such sites, which worries me.
>

That looks buggy to me. Isn't this the reason we have
set_current_state()?

Although, since it is not checking any condition, it may not be buggy.
The check inside scheduler for state != TASK_RUNNING is protected inside
the rq locks, and any other task must grab the rq lock of the task
before it can change the task's state. schedule() also checks for
signals which would force schedule() to wake it up.

But that said, I still think that should be changed to
set_current_state(TASK_INTERRUPTIBLE).


-- Steve


2010-01-20 03:13:27

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

* Peter Zijlstra ([email protected]) wrote:
> On Tue, 2010-01-19 at 19:37 +0100, Peter Zijlstra wrote:
> > On Thu, 2010-01-14 at 14:33 -0500, Mathieu Desnoyers wrote:
> > > It's a case where CPU 1 switches from our mm to another mm:
> > >
> > > CPU 0 (membarrier) CPU 1 (another mm -our mm)
> > > <user-space> <user-space>
> > > <buffered access C.S. data>
> > > urcu read unlock()
> > > barrier()
> > > store local gp
> > > <kernel-space>
> >
> > OK, so the question is how we end up here, if its though interrupt
> > preemption I think the interrupt delivery will imply an mb,
>
> I keep thinking that, but I think we actually refuted that in an earlier
> discussion on this patch.

Intel Architecture Software Developer's Manual Vol. 3: System
Programming
7.4 Serializing Instructions

"MOV to control reg, MOV to debug reg, WRMSR, INVD, INVLPG, WBINDV, LGDT,
LLDT, LIDT, LTR, CPUID, IRET, RSM"

So, this list does _not_ include: INT, SYSENTER, SYSEXIT.

Only IRET is included. So I don't think it is safe to assume that x86
has serializing instructions when entering/leaving the kernel.

>
> > if its a
> > blocking syscall, the set_task_state() mb [*] should be there.
> >
> > Then we also do:
> >
> > clear_tsk_need_resched()
> >
> > which is an atomic bitop (although does not imply a full barrier
> > per-se).
> >
> > > rq->curr = next (1)
>
> We could possibly look at placing that assignment in context_switch()
> between switch_mm() and switch_to(), which should provide a mb before
> and after I think, Ingo?

That's an interesting idea. It would indeed fix the problem of the
missing barrier before the assignment, but would lack the appropriate
barrier after the assignment. If the rq->curr = next; assignment is made
after load_cr3, then we lack a memory barrier between the assignment and
execution of following user-space code after returning with SYSEXIT (and
we lack the appropriate barrier for other architectures too).

Thanks,

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2010-01-20 08:46:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

On Tue, 2010-01-19 at 22:13 -0500, Mathieu Desnoyers wrote:
> * Peter Zijlstra ([email protected]) wrote:
> > On Tue, 2010-01-19 at 19:37 +0100, Peter Zijlstra wrote:
> > > On Thu, 2010-01-14 at 14:33 -0500, Mathieu Desnoyers wrote:
> > > > It's a case where CPU 1 switches from our mm to another mm:
> > > >
> > > > CPU 0 (membarrier) CPU 1 (another mm -our mm)
> > > > <user-space> <user-space>
> > > > <buffered access C.S. data>
> > > > urcu read unlock()
> > > > barrier()
> > > > store local gp
> > > > <kernel-space>
> > >
> > > OK, so the question is how we end up here, if its though interrupt
> > > preemption I think the interrupt delivery will imply an mb,
> >
> > I keep thinking that, but I think we actually refuted that in an earlier
> > discussion on this patch.
>
> Intel Architecture Software Developer's Manual Vol. 3: System
> Programming
> 7.4 Serializing Instructions
>
> "MOV to control reg, MOV to debug reg, WRMSR, INVD, INVLPG, WBINDV, LGDT,
> LLDT, LIDT, LTR, CPUID, IRET, RSM"
>
> So, this list does _not_ include: INT, SYSENTER, SYSEXIT.
>
> Only IRET is included. So I don't think it is safe to assume that x86
> has serializing instructions when entering/leaving the kernel.

I got confused by 7.1.2.1 automatic locking on interrupt acknowledge.

But I already retracted that stmt.

> >
> > > if its a
> > > blocking syscall, the set_task_state() mb [*] should be there.
> > >
> > > Then we also do:
> > >
> > > clear_tsk_need_resched()
> > >
> > > which is an atomic bitop (although does not imply a full barrier
> > > per-se).
> > >
> > > > rq->curr = next (1)
> >
> > We could possibly look at placing that assignment in context_switch()
> > between switch_mm() and switch_to(), which should provide a mb before
> > and after I think, Ingo?
>
> That's an interesting idea. It would indeed fix the problem of the
> missing barrier before the assignment, but would lack the appropriate
> barrier after the assignment. If the rq->curr = next; assignment is made
> after load_cr3, then we lack a memory barrier between the assignment and
> execution of following user-space code after returning with SYSEXIT (and
> we lack the appropriate barrier for other architectures too).

Well, 7.1.2.1 says that writing a segment register implies a LOCK, but
on second reading there are a number of qualifiers there, not sure we
satisfy that.

Peter, does our switch_to() imply a mb?

2010-01-21 11:28:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

On Tue, 2010-01-19 at 20:06 +0100, Peter Zijlstra wrote:
>
> We could possibly look at placing that assignment in context_switch()
> between switch_mm() and switch_to(), which should provide a mb before
> and after I think, Ingo?

Right, just found out why we cannot do that, the first thing
context_switch() does is prepare_context_switch() which includes
prepare_lock_switch() which on __ARCH_WANT_UNLOCKED_CTXSW machines drops
the rq->lock, and we have to have rq->curr assigned by then.


2010-01-21 16:12:19

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

On Thu, 2010-01-21 at 11:07 -0500, Mathieu Desnoyers wrote:
> * Peter Zijlstra ([email protected]) wrote:

> We can even create a generic fallback with the following kind of code in
> the meantime:
>
> static inline void spin_lock_mb(spinlock_t *lock)
> {
> spin_lock(&lock);

That would be spin_lock(lock);

> smp_mb();
> }
>
> static inline void spin_unlock_mb(spinlock_t *lock)
> {
> smp_mb();
> spin_unlock(&lock);

and spin_unlock(lock);

;-)

> }
>
> How does that sound ?

You may also need spin_lock_irqsave, et al. variants too.

-- Steve

2010-01-21 16:12:37

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

* Peter Zijlstra ([email protected]) wrote:
> On Tue, 2010-01-19 at 20:06 +0100, Peter Zijlstra wrote:
> >
> > We could possibly look at placing that assignment in context_switch()
> > between switch_mm() and switch_to(), which should provide a mb before
> > and after I think, Ingo?
>
> Right, just found out why we cannot do that, the first thing
> context_switch() does is prepare_context_switch() which includes
> prepare_lock_switch() which on __ARCH_WANT_UNLOCKED_CTXSW machines drops
> the rq->lock, and we have to have rq->curr assigned by then.
>

OK.

One efficient way to fit the requirement of sys_membarrier() would be to
create spin_lock_mb()/spin_unlock_mb(), which would have full memory
barriers rather than the acquire/release semantic. These could be used
within schedule() execution. On UP, they would turn into preempt off/on
and a compiler barrier, just like normal spin locks.

On architectures like x86, the atomic instructions already imply a full
memory barrier, so we have a direct mapping and no overhead. On
architecture where the spin lock only provides acquire semantic (e.g.
powerpc using lwsync and isync), then we would have to create an
alternate implementation with "sync".

We can even create a generic fallback with the following kind of code in
the meantime:

static inline void spin_lock_mb(spinlock_t *lock)
{
spin_lock(&lock);
smp_mb();
}

static inline void spin_unlock_mb(spinlock_t *lock)
{
smp_mb();
spin_unlock(&lock);
}

How does that sound ?

Mathieu


--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2010-01-21 16:17:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

On Thu, 2010-01-21 at 11:07 -0500, Mathieu Desnoyers wrote:
>
> One efficient way to fit the requirement of sys_membarrier() would be to
> create spin_lock_mb()/spin_unlock_mb(), which would have full memory
> barriers rather than the acquire/release semantic. These could be used
> within schedule() execution. On UP, they would turn into preempt off/on
> and a compiler barrier, just like normal spin locks.
>
> On architectures like x86, the atomic instructions already imply a full
> memory barrier, so we have a direct mapping and no overhead. On
> architecture where the spin lock only provides acquire semantic (e.g.
> powerpc using lwsync and isync), then we would have to create an
> alternate implementation with "sync".

There's also clear_tsk_need_resched() which is an atomic op.

The thing I'm worrying about is not making schedule() more expensive for
a relatively rare operation like sys_membarrier(), while at the same
time trying to not make while (1) sys_membarrier() ruin your system.

On x86 there is plenty that implies a full mb before rq->curr = next,
the thing to figure out is what is generally the cheapest place to force
one for other architectures.

Not sure where that leaves us, since I'm not too familiar with !x86.

2010-01-21 16:22:52

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

* Steven Rostedt ([email protected]) wrote:
> On Thu, 2010-01-21 at 11:07 -0500, Mathieu Desnoyers wrote:
> > * Peter Zijlstra ([email protected]) wrote:
>
> > We can even create a generic fallback with the following kind of code in
> > the meantime:
> >
> > static inline void spin_lock_mb(spinlock_t *lock)
> > {
> > spin_lock(&lock);
>
> That would be spin_lock(lock);
>
> > smp_mb();
> > }
> >
> > static inline void spin_unlock_mb(spinlock_t *lock)
> > {
> > smp_mb();
> > spin_unlock(&lock);
>
> and spin_unlock(lock);
>
> ;-)

Oh, right. I should think of integrating a compiler in my mail client ;)


>
> > }
> >
> > How does that sound ?
>
> You may also need spin_lock_irqsave, et al. variants too.

Yep, or we simply use the local_irq_save/restore separately. That could
be a good idea given that only few specialized sites are affected.

Thanks,

Mathieu

>
> -- Steve
>
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2010-01-21 16:32:24

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

On Thu, 2010-01-21 at 11:22 -0500, Mathieu Desnoyers wrote:
> * Steven Rostedt ([email protected]) wrote:

> > You may also need spin_lock_irqsave, et al. variants too.
>
> Yep, or we simply use the local_irq_save/restore separately. That could
> be a good idea given that only few specialized sites are affected.

If it gets used more often, then we need to consider RT. RT modifies
spin_lock_irqsave into a standard mutex that does not disable
interrupts. But if something does:

local_irq_save(flags);
spin_lock_mb(&lock);
[...]

Then it will break RT (if lock is to be converted).

-- Steve

2010-01-21 17:01:14

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

* Peter Zijlstra ([email protected]) wrote:
> On Thu, 2010-01-21 at 11:07 -0500, Mathieu Desnoyers wrote:
> >
> > One efficient way to fit the requirement of sys_membarrier() would be to
> > create spin_lock_mb()/spin_unlock_mb(), which would have full memory
> > barriers rather than the acquire/release semantic. These could be used
> > within schedule() execution. On UP, they would turn into preempt off/on
> > and a compiler barrier, just like normal spin locks.
> >
> > On architectures like x86, the atomic instructions already imply a full
> > memory barrier, so we have a direct mapping and no overhead. On
> > architecture where the spin lock only provides acquire semantic (e.g.
> > powerpc using lwsync and isync), then we would have to create an
> > alternate implementation with "sync".
>
> There's also clear_tsk_need_resched() which is an atomic op.

But clear_bit() only acts as a full memory barrier on x86 due to the
lock-prefix side-effect.

Ideally, if we add some kind of synchronization, it would be good to
piggy-back on spin lock/unlock, because these already identify
synchronization points (acquire/release semantic). It also surrounds the
scheduler execution. As we need memory barriers before and after the
data modification, this looks like a sane way to proceed: if data update
is protected by the spinlock, then we are sure that we have the matching
full memory barriers.

>
> The thing I'm worrying about is not making schedule() more expensive for
> a relatively rare operation like sys_membarrier(), while at the same
> time trying to not make while (1) sys_membarrier() ruin your system.

Yep, I share your concern.

>
> On x86 there is plenty that implies a full mb before rq->curr = next,
> the thing to figure out is what is generally the cheapest place to force
> one for other architectures.

Yep.

>
> Not sure where that leaves us, since I'm not too familiar with !x86.
>

As I proposed above, I think what we have to look for is: where do we
already have some weak memory barriers already required ? And then
upgrade these memory barriers to full memory barriers. The spinlock
approach is one possible solution.

The problem with piggy-backing on clear_flag/set_flag is that these
operations don't semantically imply memory barriers at all. So adding
an additional full mb() around these would be much more costly than
"upgrading" an already-existing barrier.

Thanks,

Mathieu


--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2010-01-21 17:07:22

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

* Steven Rostedt ([email protected]) wrote:
> On Thu, 2010-01-21 at 11:22 -0500, Mathieu Desnoyers wrote:
> > * Steven Rostedt ([email protected]) wrote:
>
> > > You may also need spin_lock_irqsave, et al. variants too.
> >
> > Yep, or we simply use the local_irq_save/restore separately. That could
> > be a good idea given that only few specialized sites are affected.
>
> If it gets used more often, then we need to consider RT. RT modifies
> spin_lock_irqsave into a standard mutex that does not disable
> interrupts. But if something does:
>
> local_irq_save(flags);
> spin_lock_mb(&lock);
> [...]
>
> Then it will break RT (if lock is to be converted).

Good point !

Mathieu

>
> -- Steve
>
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68