Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751593AbdG0W5d (ORCPT ); Thu, 27 Jul 2017 18:57:33 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:59133 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751113AbdG0W5b (ORCPT ); Thu, 27 Jul 2017 18:57:31 -0400 Date: Thu, 27 Jul 2017 15:57:27 -0700 From: "Paul E. McKenney" To: Mathieu Desnoyers Cc: Peter Zijlstra , linux-kernel , Boqun Feng , Andrew Hunter , maged michael , gromer , Avi Kivity Subject: Re: [RFC PATCH v2] membarrier: expedited private command Reply-To: paulmck@linux.vnet.ibm.com References: <20170727211314.32666-1-mathieu.desnoyers@efficios.com> <20170727221357.GS3730@linux.vnet.ibm.com> <1653464831.29191.1501195285642.JavaMail.zimbra@efficios.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1653464831.29191.1501195285642.JavaMail.zimbra@efficios.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 x-cbid: 17072722-0044-0000-0000-0000037459F3 X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00007437; HX=3.00000241; KW=3.00000007; PH=3.00000004; SC=3.00000214; SDB=6.00893836; UDB=6.00446899; IPR=6.00673997; BA=6.00005495; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00016414; XFM=3.00000015; UTC=2017-07-27 22:57:28 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17072722-0045-0000-0000-000007A264E8 Message-Id: <20170727225727.GT3730@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-07-27_14:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=2 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1706020000 definitions=main-1707270350 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 13063 Lines: 339 On Thu, Jul 27, 2017 at 10:41:25PM +0000, Mathieu Desnoyers wrote: > ----- On Jul 27, 2017, at 6:13 PM, Paul E. McKenney paulmck@linux.vnet.ibm.com wrote: > > > On Thu, Jul 27, 2017 at 05:13:14PM -0400, Mathieu Desnoyers wrote: > >> Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built > >> from all runqueues for which current thread's mm is the same as the > >> thread calling sys_membarrier. > >> > >> Scheduler-wise, it requires that we add a memory barrier after context > >> switching between processes (which have different mm). Interestingly, > >> there is already a memory barrier in mmdrop(), so we only need to add > >> a barrier when switching from a kernel thread to a userspace thread. > >> We also don't need to add the barrier when switching to a kernel thread, > >> because it has no userspace memory mapping, which makes ordering of > >> user-space memory accesses pretty much useless. > >> > >> * Benchmark > >> > >> A stress-test benchmark of sched pipe shows that it does not add > >> significant overhead to the scheduler switching between processes: > >> > >> 100 runs of: > >> > >> taskset 01 ./perf bench sched pipe > >> > >> Running 'sched/pipe' benchmark: > >> Executed 1000000 pipe operations between two processes > >> > >> Hardware: CPU: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz > >> > >> A) With 4.13.0-rc2+ > >> at commit a97fb594bc7d ("virtio-net: fix module unloading") > >> > >> avg.: 2.923 usecs/op > >> std.dev: 0.057 usecs/op > >> > >> B) With this commit: > >> > >> avg.: 2.916 usecs/op > >> std.dev: 0.043 usecs/op > >> > >> Changes since v1: > >> - move membarrier code under kernel/sched/ because it uses the > >> scheduler runqueue, > >> - only add the barrier when we switch from a kernel thread. The case > >> where we switch from a user-space thread is already handled by > >> the atomic_dec_and_test() in mmdrop(). > >> - add a comment to mmdrop() documenting the requirement on the implicit > >> memory barrier. > >> > >> CC: Peter Zijlstra > >> CC: Paul E. McKenney > >> CC: Boqun Feng > >> CC: Andrew Hunter > >> CC: Maged Michael > >> CC: gromer@google.com > >> CC: Avi Kivity > >> Signed-off-by: Mathieu Desnoyers > > > > Looks much better, thank you! > > > > I have queued this in place of my earlier patch for the moment. If there > > are no objections, I will push this into the upcoming v4.14 merge window. > > If someone else wants to push it into v4.14, I am of course fine with > > that, and am happy to give it a reviewed-by along the way. But if there > > are objections to your patch (suitably modified based on additional > > review and testing, of course) going into v4.14, I can always fall back > > to pushing my earlier simpler but less housebroken patch. ;-) > > I'm fine about you picking up this patch, even though it's tagged "RFC". > I'm sure concerns will have plenty of time to be voiced by others until > it reaches mainline anyway, at which point I'll address them and resubmit > new versions. Works for me! My -rcu tree is subject to rebasing, so I can easily replace the current patch with an updated one. Thanx, Paul > Thanks! > > Mathieu > > > > > Thanx, Paul > > > >> --- > >> MAINTAINERS | 2 +- > >> include/linux/sched/mm.h | 5 +++ > >> include/uapi/linux/membarrier.h | 23 +++++++++++-- > >> kernel/Makefile | 1 - > >> kernel/sched/Makefile | 1 + > >> kernel/sched/core.c | 27 ++++++++++++++++ > >> kernel/{ => sched}/membarrier.c | 72 ++++++++++++++++++++++++++++++++++++++++- > >> 7 files changed, 126 insertions(+), 5 deletions(-) > >> rename kernel/{ => sched}/membarrier.c (59%) > >> > >> diff --git a/MAINTAINERS b/MAINTAINERS > >> index f66488dfdbc9..3b035584272f 100644 > >> --- a/MAINTAINERS > >> +++ b/MAINTAINERS > >> @@ -8621,7 +8621,7 @@ M: Mathieu Desnoyers > >> M: "Paul E. McKenney" > >> L: linux-kernel@vger.kernel.org > >> S: Supported > >> -F: kernel/membarrier.c > >> +F: kernel/sched/membarrier.c > >> F: include/uapi/linux/membarrier.h > >> > >> MEMORY MANAGEMENT > >> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h > >> index 2b24a6974847..5c5384d9ae0f 100644 > >> --- a/include/linux/sched/mm.h > >> +++ b/include/linux/sched/mm.h > >> @@ -38,6 +38,11 @@ static inline void mmgrab(struct mm_struct *mm) > >> extern void __mmdrop(struct mm_struct *); > >> static inline void mmdrop(struct mm_struct *mm) > >> { > >> + /* > >> + * Implicit full memory barrier provided by > >> + * atomic_dec_and_test() is required by membarrier. See comments > >> + * around membarrier_expedited_mb_after_set_current(). > >> + */ > >> if (unlikely(atomic_dec_and_test(&mm->mm_count))) > >> __mmdrop(mm); > >> } > >> diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h > >> index e0b108bd2624..6d47b3249d8a 100644 > >> --- a/include/uapi/linux/membarrier.h > >> +++ b/include/uapi/linux/membarrier.h > >> @@ -40,14 +40,33 @@ > >> * (non-running threads are de facto in such a > >> * state). This covers threads from all processes > >> * running on the system. This command returns 0. > >> + * @MEMBARRIER_CMD_PRIVATE_EXPEDITED: > >> + * Execute a memory barrier on each running > >> + * thread belonging to the same process as the current > >> + * thread. Upon return from system call, the > >> + * caller thread is ensured that all its running > >> + * threads siblings have passed through a state > >> + * where all memory accesses to user-space > >> + * addresses match program order between entry > >> + * to and return from the system call > >> + * (non-running threads are de facto in such a > >> + * state). This only covers threads from the > >> + * same processes as the caller thread. This > >> + * command returns 0. The "expedited" commands > >> + * complete faster than the non-expedited ones, > >> + * they never block, but have the downside of > >> + * causing extra overhead. > >> * > >> * Command to be passed to the membarrier system call. The commands need to > >> * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to > >> * the value 0. > >> */ > >> enum membarrier_cmd { > >> - MEMBARRIER_CMD_QUERY = 0, > >> - MEMBARRIER_CMD_SHARED = (1 << 0), > >> + MEMBARRIER_CMD_QUERY = 0, > >> + MEMBARRIER_CMD_SHARED = (1 << 0), > >> + /* reserved for MEMBARRIER_CMD_SHARED_EXPEDITED (1 << 1) */ > >> + /* reserved for MEMBARRIER_CMD_PRIVATE (1 << 2) */ > >> + MEMBARRIER_CMD_PRIVATE_EXPEDITED = (1 << 3), > >> }; > >> > >> #endif /* _UAPI_LINUX_MEMBARRIER_H */ > >> diff --git a/kernel/Makefile b/kernel/Makefile > >> index 4cb8e8b23c6e..9c323a6daa46 100644 > >> --- a/kernel/Makefile > >> +++ b/kernel/Makefile > >> @@ -108,7 +108,6 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o > >> obj-$(CONFIG_JUMP_LABEL) += jump_label.o > >> obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o > >> obj-$(CONFIG_TORTURE_TEST) += torture.o > >> -obj-$(CONFIG_MEMBARRIER) += membarrier.o > >> > >> obj-$(CONFIG_HAS_IOMEM) += memremap.o > >> > >> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile > >> index 53f0164ed362..78f54932ea1d 100644 > >> --- a/kernel/sched/Makefile > >> +++ b/kernel/sched/Makefile > >> @@ -25,3 +25,4 @@ obj-$(CONFIG_SCHED_DEBUG) += debug.o > >> obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o > >> obj-$(CONFIG_CPU_FREQ) += cpufreq.o > >> obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o > >> +obj-$(CONFIG_MEMBARRIER) += membarrier.o > >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c > >> index 17c667b427b4..01e3b881ab3a 100644 > >> --- a/kernel/sched/core.c > >> +++ b/kernel/sched/core.c > >> @@ -2724,6 +2724,32 @@ asmlinkage __visible void schedule_tail(struct > >> task_struct *prev) > >> put_user(task_pid_vnr(current), current->set_child_tid); > >> } > >> > >> +static void membarrier_expedited_mb_after_set_current(struct mm_struct *mm, > >> + struct mm_struct *oldmm) > >> +{ > >> + if (!IS_ENABLED(CONFIG_MEMBARRIER)) > >> + return; > >> + /* > >> + * __schedule()-> > >> + * finish_task_switch()-> > >> + * if (mm) > >> + * mmdrop(mm) -> > >> + * atomic_dec_and_test() > >> + * takes care of issuing a memory barrier when oldmm is > >> + * non-NULL. We also don't need the barrier when switching to a > >> + * kernel thread, nor when we switch between threads belonging > >> + * to the same process. > >> + */ > >> + if (likely(oldmm || !mm || mm == oldmm)) > >> + return; > >> + /* > >> + * When switching between processes, membarrier expedited > >> + * private requires a memory barrier after we set the current > >> + * task. > >> + */ > >> + smp_mb(); > >> +} > >> + > >> /* > >> * context_switch - switch to the new MM and the new thread's register state. > >> */ > >> @@ -2737,6 +2763,7 @@ context_switch(struct rq *rq, struct task_struct *prev, > >> > >> mm = next->mm; > >> oldmm = prev->active_mm; > >> + membarrier_expedited_mb_after_set_current(mm, oldmm); > >> /* > >> * For paravirt, this is coupled with an exit in switch_to to > >> * combine the page table reload and the switch backend into > >> diff --git a/kernel/membarrier.c b/kernel/sched/membarrier.c > >> similarity index 59% > >> rename from kernel/membarrier.c > >> rename to kernel/sched/membarrier.c > >> index 9f9284f37f8d..f80828b0b607 100644 > >> --- a/kernel/membarrier.c > >> +++ b/kernel/sched/membarrier.c > >> @@ -17,12 +17,79 @@ > >> #include > >> #include > >> #include > >> +#include > >> + > >> +#include "sched.h" /* for cpu_rq(). */ > >> > >> /* > >> * Bitmask made from a "or" of all commands within enum membarrier_cmd, > >> * except MEMBARRIER_CMD_QUERY. > >> */ > >> -#define MEMBARRIER_CMD_BITMASK (MEMBARRIER_CMD_SHARED) > >> +#define MEMBARRIER_CMD_BITMASK \ > >> + (MEMBARRIER_CMD_SHARED | MEMBARRIER_CMD_PRIVATE_EXPEDITED) > >> + > >> +static void ipi_mb(void *info) > >> +{ > >> + smp_mb(); /* IPIs should be serializing but paranoid. */ > >> +} > >> + > >> +static void membarrier_private_expedited(void) > >> +{ > >> + int cpu, this_cpu; > >> + bool fallback = false; > >> + cpumask_var_t tmpmask; > >> + > >> + if (num_online_cpus() == 1) > >> + return; > >> + > >> + /* > >> + * Matches memory barriers around rq->curr modification in > >> + * scheduler. > >> + */ > >> + smp_mb(); /* system call entry is not a mb. */ > >> + > >> + if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) { > >> + /* Fallback for OOM. */ > >> + fallback = true; > >> + } > >> + > >> + /* > >> + * Skipping the current CPU is OK even through we can be > >> + * migrated at any point. The current CPU, at the point where we > >> + * read raw_smp_processor_id(), is ensured to be in program > >> + * order with respect to the caller thread. Therefore, we can > >> + * skip this CPU from the iteration. > >> + */ > >> + this_cpu = raw_smp_processor_id(); > >> + cpus_read_lock(); > >> + for_each_online_cpu(cpu) { > >> + struct task_struct *p; > >> + > >> + if (cpu == this_cpu) > >> + continue; > >> + rcu_read_lock(); > >> + p = task_rcu_dereference(&cpu_rq(cpu)->curr); > >> + if (p && p->mm == current->mm) { > >> + if (!fallback) > >> + __cpumask_set_cpu(cpu, tmpmask); > >> + else > >> + smp_call_function_single(cpu, ipi_mb, NULL, 1); > >> + } > >> + rcu_read_unlock(); > >> + } > >> + cpus_read_unlock(); > >> + if (!fallback) { > >> + smp_call_function_many(tmpmask, ipi_mb, NULL, 1); > >> + free_cpumask_var(tmpmask); > >> + } > >> + > >> + /* > >> + * Memory barrier on the caller thread _after_ we finished > >> + * waiting for the last IPI. Matches memory barriers around > >> + * rq->curr modification in scheduler. > >> + */ > >> + smp_mb(); /* exit from system call is not a mb */ > >> +} > >> > >> /** > >> * sys_membarrier - issue memory barriers on a set of threads > >> @@ -64,6 +131,9 @@ SYSCALL_DEFINE2(membarrier, int, cmd, int, flags) > >> if (num_online_cpus() > 1) > >> synchronize_sched(); > >> return 0; > >> + case MEMBARRIER_CMD_PRIVATE_EXPEDITED: > >> + membarrier_private_expedited(); > >> + return 0; > >> default: > >> return -EINVAL; > >> } > >> -- > >> 2.11.0 > > -- > Mathieu Desnoyers > EfficiOS Inc. > http://www.efficios.com >