Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752958AbdG1U7V (ORCPT ); Fri, 28 Jul 2017 16:59:21 -0400 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:46631 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1752769AbdG1U7U (ORCPT ); Fri, 28 Jul 2017 16:59:20 -0400 Date: Fri, 28 Jul 2017 13:58:29 -0700 From: "Paul E. McKenney" To: Mathieu Desnoyers Cc: Peter Zijlstra , linux-kernel@vger.kernel.org, Boqun Feng , Andrew Hunter , Maged Michael , gromer@google.com, Avi Kivity , Benjamin Herrenschmidt , Paul Mackerras , Michael Ellerman Subject: Re: [PATCH v4] membarrier: expedited private command Reply-To: paulmck@linux.vnet.ibm.com References: <20170728204040.568-1-mathieu.desnoyers@efficios.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170728204040.568-1-mathieu.desnoyers@efficios.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 x-cbid: 17072820-0024-0000-0000-000002B7351E X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00007443; HX=3.00000241; KW=3.00000007; PH=3.00000004; SC=3.00000214; SDB=6.00894277; UDB=6.00447161; IPR=6.00674437; BA=6.00005496; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00016429; XFM=3.00000015; UTC=2017-07-28 20:58:32 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17072820-0025-0000-0000-000044E92B6B Message-Id: <20170728205829.GO3730@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-07-28_10:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=2 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1706020000 definitions=main-1707280337 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 18231 Lines: 463 On Fri, Jul 28, 2017 at 04:40:40PM -0400, Mathieu Desnoyers wrote: > Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built > from all runqueues for which current thread's mm is the same as the > thread calling sys_membarrier. It executes faster than the non-expedited > variant (no blocking). It also works on NOHZ_FULL configurations. > > Scheduler-wise, it requires a memory barrier before and after context > switching between processes (which have different mm). The memory > barrier before context switch is already present. For the barrier after > context switch: > > * Our TSO archs can do RELEASE without being a full barrier. Look at > x86 spin_unlock() being a regular STORE for example. But for those > archs, all atomics imply smp_mb and all of them have atomic ops in > switch_mm() for mm_cpumask(), and issue load_cr3() which acts as a > full barrier. > > * From all weakly ordered machines, only ARM64 and PPC can do RELEASE, > the rest does indeed do smp_mb(), so there the spin_unlock() is a full > barrier and we're good. > > * ARM64 has a very heavy barrier in switch_to(), which suffices. > > * PPC just removed its barrier from switch_to(), but appears to be > talking about adding something to switch_mm(). So add a > smp_mb__after_unlock_lock() for now, until this is settled on the PPC > side. > > Changes since v3: > - Properly document the memory barriers provided by each architecture. > > Changes since v2: > - Address comments from Peter Zijlstra, > - Add smp_mb__after_unlock_lock() after finish_lock_switch() in > finish_task_switch() to add the memory barrier we need after storing > to rq->curr. This is much simpler than the previous approach relying > on atomic_dec_and_test() in mmdrop(), which actually added a memory > barrier in the common case of switching between userspace processes. > - Return -EINVAL when MEMBARRIER_CMD_SHARED is used on a nohz_full > kernel, rather than having the whole membarrier system call returning > -ENOSYS. Indeed, CMD_PRIVATE_EXPEDITED is compatible with nohz_full. > Adapt the CMD_QUERY mask accordingly. > > Changes since v1: > - move membarrier code under kernel/sched/ because it uses the > scheduler runqueue, > - only add the barrier when we switch from a kernel thread. The case > where we switch from a user-space thread is already handled by > the atomic_dec_and_test() in mmdrop(). > - add a comment to mmdrop() documenting the requirement on the implicit > memory barrier. > > CC: Peter Zijlstra > CC: Paul E. McKenney > CC: Boqun Feng > CC: Andrew Hunter > CC: Maged Michael > CC: gromer@google.com > CC: Avi Kivity > CC: Benjamin Herrenschmidt > CC: Paul Mackerras > CC: Michael Ellerman > Signed-off-by: Mathieu Desnoyers And replaced v3 with this one, thank you! Thanx, Paul > --- > MAINTAINERS | 2 +- > arch/arm64/kernel/process.c | 2 + > arch/x86/mm/tlb.c | 3 +- > include/uapi/linux/membarrier.h | 23 +++++- > kernel/Makefile | 1 - > kernel/membarrier.c | 70 ------------------ > kernel/sched/Makefile | 1 + > kernel/sched/core.c | 25 +++++++ > kernel/sched/membarrier.c | 152 ++++++++++++++++++++++++++++++++++++++++ > 9 files changed, 204 insertions(+), 75 deletions(-) > delete mode 100644 kernel/membarrier.c > create mode 100644 kernel/sched/membarrier.c > > diff --git a/MAINTAINERS b/MAINTAINERS > index f66488dfdbc9..3b035584272f 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -8621,7 +8621,7 @@ M: Mathieu Desnoyers > M: "Paul E. McKenney" > L: linux-kernel@vger.kernel.org > S: Supported > -F: kernel/membarrier.c > +F: kernel/sched/membarrier.c > F: include/uapi/linux/membarrier.h > > MEMORY MANAGEMENT > diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c > index 659ae8094ed5..c8f7d98d8cb9 100644 > --- a/arch/arm64/kernel/process.c > +++ b/arch/arm64/kernel/process.c > @@ -360,6 +360,8 @@ __notrace_funcgraph struct task_struct *__switch_to(struct task_struct *prev, > /* > * Complete any pending TLB or cache maintenance on this CPU in case > * the thread migrates to a different CPU. > + * This full barrier is also required by the membarrier system > + * call. > */ > dsb(ish); > > diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c > index 014d07a80053..bb103d693f33 100644 > --- a/arch/x86/mm/tlb.c > +++ b/arch/x86/mm/tlb.c > @@ -132,7 +132,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, > * due to instruction fetches or for no reason at all, > * and neither LOCK nor MFENCE orders them. > * Fortunately, load_cr3() is serializing and gives the > - * ordering guarantee we need. > + * ordering guarantee we need. This full barrier is also > + * required by the membarrier system call. > */ > load_cr3(next->pgd); > > diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h > index e0b108bd2624..6d47b3249d8a 100644 > --- a/include/uapi/linux/membarrier.h > +++ b/include/uapi/linux/membarrier.h > @@ -40,14 +40,33 @@ > * (non-running threads are de facto in such a > * state). This covers threads from all processes > * running on the system. This command returns 0. > + * @MEMBARRIER_CMD_PRIVATE_EXPEDITED: > + * Execute a memory barrier on each running > + * thread belonging to the same process as the current > + * thread. Upon return from system call, the > + * caller thread is ensured that all its running > + * threads siblings have passed through a state > + * where all memory accesses to user-space > + * addresses match program order between entry > + * to and return from the system call > + * (non-running threads are de facto in such a > + * state). This only covers threads from the > + * same processes as the caller thread. This > + * command returns 0. The "expedited" commands > + * complete faster than the non-expedited ones, > + * they never block, but have the downside of > + * causing extra overhead. > * > * Command to be passed to the membarrier system call. The commands need to > * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to > * the value 0. > */ > enum membarrier_cmd { > - MEMBARRIER_CMD_QUERY = 0, > - MEMBARRIER_CMD_SHARED = (1 << 0), > + MEMBARRIER_CMD_QUERY = 0, > + MEMBARRIER_CMD_SHARED = (1 << 0), > + /* reserved for MEMBARRIER_CMD_SHARED_EXPEDITED (1 << 1) */ > + /* reserved for MEMBARRIER_CMD_PRIVATE (1 << 2) */ > + MEMBARRIER_CMD_PRIVATE_EXPEDITED = (1 << 3), > }; > > #endif /* _UAPI_LINUX_MEMBARRIER_H */ > diff --git a/kernel/Makefile b/kernel/Makefile > index 4cb8e8b23c6e..9c323a6daa46 100644 > --- a/kernel/Makefile > +++ b/kernel/Makefile > @@ -108,7 +108,6 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o > obj-$(CONFIG_JUMP_LABEL) += jump_label.o > obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o > obj-$(CONFIG_TORTURE_TEST) += torture.o > -obj-$(CONFIG_MEMBARRIER) += membarrier.o > > obj-$(CONFIG_HAS_IOMEM) += memremap.o > > diff --git a/kernel/membarrier.c b/kernel/membarrier.c > deleted file mode 100644 > index 9f9284f37f8d..000000000000 > --- a/kernel/membarrier.c > +++ /dev/null > @@ -1,70 +0,0 @@ > -/* > - * Copyright (C) 2010, 2015 Mathieu Desnoyers > - * > - * membarrier system call > - * > - * This program is free software; you can redistribute it and/or modify > - * it under the terms of the GNU General Public License as published by > - * the Free Software Foundation; either version 2 of the License, or > - * (at your option) any later version. > - * > - * This program is distributed in the hope that it will be useful, > - * but WITHOUT ANY WARRANTY; without even the implied warranty of > - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > - * GNU General Public License for more details. > - */ > - > -#include > -#include > -#include > - > -/* > - * Bitmask made from a "or" of all commands within enum membarrier_cmd, > - * except MEMBARRIER_CMD_QUERY. > - */ > -#define MEMBARRIER_CMD_BITMASK (MEMBARRIER_CMD_SHARED) > - > -/** > - * sys_membarrier - issue memory barriers on a set of threads > - * @cmd: Takes command values defined in enum membarrier_cmd. > - * @flags: Currently needs to be 0. For future extensions. > - * > - * If this system call is not implemented, -ENOSYS is returned. If the > - * command specified does not exist, or if the command argument is invalid, > - * this system call returns -EINVAL. For a given command, with flags argument > - * set to 0, this system call is guaranteed to always return the same value > - * until reboot. > - * > - * All memory accesses performed in program order from each targeted thread > - * is guaranteed to be ordered with respect to sys_membarrier(). If we use > - * the semantic "barrier()" to represent a compiler barrier forcing memory > - * accesses to be performed in program order across the barrier, and > - * smp_mb() to represent explicit memory barriers forcing full memory > - * ordering across the barrier, we have the following ordering table for > - * each pair of barrier(), sys_membarrier() and smp_mb(): > - * > - * The pair ordering is detailed as (O: ordered, X: not ordered): > - * > - * barrier() smp_mb() sys_membarrier() > - * barrier() X X O > - * smp_mb() X O O > - * sys_membarrier() O O O > - */ > -SYSCALL_DEFINE2(membarrier, int, cmd, int, flags) > -{ > - /* MEMBARRIER_CMD_SHARED is not compatible with nohz_full. */ > - if (tick_nohz_full_enabled()) > - return -ENOSYS; > - if (unlikely(flags)) > - return -EINVAL; > - switch (cmd) { > - case MEMBARRIER_CMD_QUERY: > - return MEMBARRIER_CMD_BITMASK; > - case MEMBARRIER_CMD_SHARED: > - if (num_online_cpus() > 1) > - synchronize_sched(); > - return 0; > - default: > - return -EINVAL; > - } > -} > diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile > index 53f0164ed362..78f54932ea1d 100644 > --- a/kernel/sched/Makefile > +++ b/kernel/sched/Makefile > @@ -25,3 +25,4 @@ obj-$(CONFIG_SCHED_DEBUG) += debug.o > obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o > obj-$(CONFIG_CPU_FREQ) += cpufreq.o > obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o > +obj-$(CONFIG_MEMBARRIER) += membarrier.o > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 17c667b427b4..4f85494620d7 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -2635,6 +2635,16 @@ static struct rq *finish_task_switch(struct task_struct *prev) > prev_state = prev->state; > vtime_task_switch(prev); > perf_event_task_sched_in(prev, current); > + /* > + * The membarrier system call requires a full memory barrier > + * after storing to rq->curr, before going back to user-space. > + * > + * TODO: This smp_mb__after_unlock_lock can go away if PPC end > + * up adding a full barrier to switch_mm(), or we should figure > + * out if a smp_mb__after_unlock_lock is really the proper API > + * to use. > + */ > + smp_mb__after_unlock_lock(); > finish_lock_switch(rq, prev); > finish_arch_post_lock_switch(); > > @@ -3324,6 +3334,21 @@ static void __sched notrace __schedule(bool preempt) > if (likely(prev != next)) { > rq->nr_switches++; > rq->curr = next; > + /* > + * The membarrier system call requires each architecture > + * to have a full memory barrier after updating > + * rq->curr, before returning to user-space. For TSO > + * (e.g. x86), the architecture must provide its own > + * barrier in switch_mm(). For weakly ordered machines > + * for which spin_unlock() acts as a full memory > + * barrier, finish_lock_switch() in common code takes > + * care of this barrier. For weakly ordered machines for > + * which spin_unlock() acts as a RELEASE barrier (only > + * arm64 and PowerPC), arm64 has a full barrier in > + * switch_to(), and PowerPC has > + * smp_mb__after_unlock_lock() before > + * finish_lock_switch(). > + */ > ++*switch_count; > > trace_sched_switch(preempt, prev, next); > diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c > new file mode 100644 > index 000000000000..a92fddc22747 > --- /dev/null > +++ b/kernel/sched/membarrier.c > @@ -0,0 +1,152 @@ > +/* > + * Copyright (C) 2010-2017 Mathieu Desnoyers > + * > + * membarrier system call > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License as published by > + * the Free Software Foundation; either version 2 of the License, or > + * (at your option) any later version. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU General Public License for more details. > + */ > + > +#include > +#include > +#include > +#include > + > +#include "sched.h" /* for cpu_rq(). */ > + > +/* > + * Bitmask made from a "or" of all commands within enum membarrier_cmd, > + * except MEMBARRIER_CMD_QUERY. > + */ > +#define MEMBARRIER_CMD_BITMASK \ > + (MEMBARRIER_CMD_SHARED | MEMBARRIER_CMD_PRIVATE_EXPEDITED) > + > +static void ipi_mb(void *info) > +{ > + smp_mb(); /* IPIs should be serializing but paranoid. */ > +} > + > +static void membarrier_private_expedited(void) > +{ > + int cpu; > + bool fallback = false; > + cpumask_var_t tmpmask; > + > + if (num_online_cpus() == 1) > + return; > + > + /* > + * Matches memory barriers around rq->curr modification in > + * scheduler. > + */ > + smp_mb(); /* system call entry is not a mb. */ > + > + /* > + * Expedited membarrier commands guarantee that they won't > + * block, hence the GFP_NOWAIT allocation flag and fallback > + * implementation. > + */ > + if (!zalloc_cpumask_var(&tmpmask, GFP_NOWAIT)) { > + /* Fallback for OOM. */ > + fallback = true; > + } > + > + cpus_read_lock(); > + for_each_online_cpu(cpu) { > + struct task_struct *p; > + > + /* > + * Skipping the current CPU is OK even through we can be > + * migrated at any point. The current CPU, at the point > + * where we read raw_smp_processor_id(), is ensured to > + * be in program order with respect to the caller > + * thread. Therefore, we can skip this CPU from the > + * iteration. > + */ > + if (cpu == raw_smp_processor_id()) > + continue; > + rcu_read_lock(); > + p = task_rcu_dereference(&cpu_rq(cpu)->curr); > + if (p && p->mm == current->mm) { > + if (!fallback) > + __cpumask_set_cpu(cpu, tmpmask); > + else > + smp_call_function_single(cpu, ipi_mb, NULL, 1); > + } > + rcu_read_unlock(); > + } > + if (!fallback) { > + smp_call_function_many(tmpmask, ipi_mb, NULL, 1); > + free_cpumask_var(tmpmask); > + } > + cpus_read_unlock(); > + > + /* > + * Memory barrier on the caller thread _after_ we finished > + * waiting for the last IPI. Matches memory barriers around > + * rq->curr modification in scheduler. > + */ > + smp_mb(); /* exit from system call is not a mb */ > +} > + > +/** > + * sys_membarrier - issue memory barriers on a set of threads > + * @cmd: Takes command values defined in enum membarrier_cmd. > + * @flags: Currently needs to be 0. For future extensions. > + * > + * If this system call is not implemented, -ENOSYS is returned. If the > + * command specified does not exist, not available on the running > + * kernel, or if the command argument is invalid, this system call > + * returns -EINVAL. For a given command, with flags argument set to 0, > + * this system call is guaranteed to always return the same value until > + * reboot. > + * > + * All memory accesses performed in program order from each targeted thread > + * is guaranteed to be ordered with respect to sys_membarrier(). If we use > + * the semantic "barrier()" to represent a compiler barrier forcing memory > + * accesses to be performed in program order across the barrier, and > + * smp_mb() to represent explicit memory barriers forcing full memory > + * ordering across the barrier, we have the following ordering table for > + * each pair of barrier(), sys_membarrier() and smp_mb(): > + * > + * The pair ordering is detailed as (O: ordered, X: not ordered): > + * > + * barrier() smp_mb() sys_membarrier() > + * barrier() X X O > + * smp_mb() X O O > + * sys_membarrier() O O O > + */ > +SYSCALL_DEFINE2(membarrier, int, cmd, int, flags) > +{ > + if (unlikely(flags)) > + return -EINVAL; > + switch (cmd) { > + case MEMBARRIER_CMD_QUERY: > + { > + int cmd_mask = MEMBARRIER_CMD_BITMASK; > + > + if (tick_nohz_full_enabled()) > + cmd_mask &= ~MEMBARRIER_CMD_SHARED; > + return cmd_mask; > + } > + case MEMBARRIER_CMD_SHARED: > + /* MEMBARRIER_CMD_SHARED is not compatible with nohz_full. */ > + if (tick_nohz_full_enabled()) > + return -EINVAL; > + if (num_online_cpus() > 1) > + synchronize_sched(); > + return 0; > + case MEMBARRIER_CMD_PRIVATE_EXPEDITED: > + membarrier_private_expedited(); > + return 0; > + default: > + return -EINVAL; > + } > +} > -- > 2.11.0 >