Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751684AbdG0WOD (ORCPT ); Thu, 27 Jul 2017 18:14:03 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:58117 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751601AbdG0WOC (ORCPT ); Thu, 27 Jul 2017 18:14:02 -0400 Date: Thu, 27 Jul 2017 15:13:57 -0700 From: "Paul E. McKenney" To: Mathieu Desnoyers Cc: Peter Zijlstra , linux-kernel@vger.kernel.org, Boqun Feng , Andrew Hunter , Maged Michael , gromer@google.com, Avi Kivity Subject: Re: [RFC PATCH v2] membarrier: expedited private command Reply-To: paulmck@linux.vnet.ibm.com References: <20170727211314.32666-1-mathieu.desnoyers@efficios.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170727211314.32666-1-mathieu.desnoyers@efficios.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 x-cbid: 17072722-0052-0000-0000-00000246AB88 X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00007437; HX=3.00000241; KW=3.00000007; PH=3.00000004; SC=3.00000214; SDB=6.00893822; UDB=6.00446890; IPR=6.00673982; BA=6.00005495; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00016413; XFM=3.00000015; UTC=2017-07-27 22:13:58 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17072722-0053-0000-0000-000051795E48 Message-Id: <20170727221357.GS3730@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-07-27_13:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=2 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1706020000 definitions=main-1707270343 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 11468 Lines: 315 On Thu, Jul 27, 2017 at 05:13:14PM -0400, Mathieu Desnoyers wrote: > Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built > from all runqueues for which current thread's mm is the same as the > thread calling sys_membarrier. > > Scheduler-wise, it requires that we add a memory barrier after context > switching between processes (which have different mm). Interestingly, > there is already a memory barrier in mmdrop(), so we only need to add > a barrier when switching from a kernel thread to a userspace thread. > We also don't need to add the barrier when switching to a kernel thread, > because it has no userspace memory mapping, which makes ordering of > user-space memory accesses pretty much useless. > > * Benchmark > > A stress-test benchmark of sched pipe shows that it does not add > significant overhead to the scheduler switching between processes: > > 100 runs of: > > taskset 01 ./perf bench sched pipe > > Running 'sched/pipe' benchmark: > Executed 1000000 pipe operations between two processes > > Hardware: CPU: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz > > A) With 4.13.0-rc2+ > at commit a97fb594bc7d ("virtio-net: fix module unloading") > > avg.: 2.923 usecs/op > std.dev: 0.057 usecs/op > > B) With this commit: > > avg.: 2.916 usecs/op > std.dev: 0.043 usecs/op > > Changes since v1: > - move membarrier code under kernel/sched/ because it uses the > scheduler runqueue, > - only add the barrier when we switch from a kernel thread. The case > where we switch from a user-space thread is already handled by > the atomic_dec_and_test() in mmdrop(). > - add a comment to mmdrop() documenting the requirement on the implicit > memory barrier. > > CC: Peter Zijlstra > CC: Paul E. McKenney > CC: Boqun Feng > CC: Andrew Hunter > CC: Maged Michael > CC: gromer@google.com > CC: Avi Kivity > Signed-off-by: Mathieu Desnoyers Looks much better, thank you! I have queued this in place of my earlier patch for the moment. If there are no objections, I will push this into the upcoming v4.14 merge window. If someone else wants to push it into v4.14, I am of course fine with that, and am happy to give it a reviewed-by along the way. But if there are objections to your patch (suitably modified based on additional review and testing, of course) going into v4.14, I can always fall back to pushing my earlier simpler but less housebroken patch. ;-) Thanx, Paul > --- > MAINTAINERS | 2 +- > include/linux/sched/mm.h | 5 +++ > include/uapi/linux/membarrier.h | 23 +++++++++++-- > kernel/Makefile | 1 - > kernel/sched/Makefile | 1 + > kernel/sched/core.c | 27 ++++++++++++++++ > kernel/{ => sched}/membarrier.c | 72 ++++++++++++++++++++++++++++++++++++++++- > 7 files changed, 126 insertions(+), 5 deletions(-) > rename kernel/{ => sched}/membarrier.c (59%) > > diff --git a/MAINTAINERS b/MAINTAINERS > index f66488dfdbc9..3b035584272f 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -8621,7 +8621,7 @@ M: Mathieu Desnoyers > M: "Paul E. McKenney" > L: linux-kernel@vger.kernel.org > S: Supported > -F: kernel/membarrier.c > +F: kernel/sched/membarrier.c > F: include/uapi/linux/membarrier.h > > MEMORY MANAGEMENT > diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h > index 2b24a6974847..5c5384d9ae0f 100644 > --- a/include/linux/sched/mm.h > +++ b/include/linux/sched/mm.h > @@ -38,6 +38,11 @@ static inline void mmgrab(struct mm_struct *mm) > extern void __mmdrop(struct mm_struct *); > static inline void mmdrop(struct mm_struct *mm) > { > + /* > + * Implicit full memory barrier provided by > + * atomic_dec_and_test() is required by membarrier. See comments > + * around membarrier_expedited_mb_after_set_current(). > + */ > if (unlikely(atomic_dec_and_test(&mm->mm_count))) > __mmdrop(mm); > } > diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h > index e0b108bd2624..6d47b3249d8a 100644 > --- a/include/uapi/linux/membarrier.h > +++ b/include/uapi/linux/membarrier.h > @@ -40,14 +40,33 @@ > * (non-running threads are de facto in such a > * state). This covers threads from all processes > * running on the system. This command returns 0. > + * @MEMBARRIER_CMD_PRIVATE_EXPEDITED: > + * Execute a memory barrier on each running > + * thread belonging to the same process as the current > + * thread. Upon return from system call, the > + * caller thread is ensured that all its running > + * threads siblings have passed through a state > + * where all memory accesses to user-space > + * addresses match program order between entry > + * to and return from the system call > + * (non-running threads are de facto in such a > + * state). This only covers threads from the > + * same processes as the caller thread. This > + * command returns 0. The "expedited" commands > + * complete faster than the non-expedited ones, > + * they never block, but have the downside of > + * causing extra overhead. > * > * Command to be passed to the membarrier system call. The commands need to > * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to > * the value 0. > */ > enum membarrier_cmd { > - MEMBARRIER_CMD_QUERY = 0, > - MEMBARRIER_CMD_SHARED = (1 << 0), > + MEMBARRIER_CMD_QUERY = 0, > + MEMBARRIER_CMD_SHARED = (1 << 0), > + /* reserved for MEMBARRIER_CMD_SHARED_EXPEDITED (1 << 1) */ > + /* reserved for MEMBARRIER_CMD_PRIVATE (1 << 2) */ > + MEMBARRIER_CMD_PRIVATE_EXPEDITED = (1 << 3), > }; > > #endif /* _UAPI_LINUX_MEMBARRIER_H */ > diff --git a/kernel/Makefile b/kernel/Makefile > index 4cb8e8b23c6e..9c323a6daa46 100644 > --- a/kernel/Makefile > +++ b/kernel/Makefile > @@ -108,7 +108,6 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o > obj-$(CONFIG_JUMP_LABEL) += jump_label.o > obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o > obj-$(CONFIG_TORTURE_TEST) += torture.o > -obj-$(CONFIG_MEMBARRIER) += membarrier.o > > obj-$(CONFIG_HAS_IOMEM) += memremap.o > > diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile > index 53f0164ed362..78f54932ea1d 100644 > --- a/kernel/sched/Makefile > +++ b/kernel/sched/Makefile > @@ -25,3 +25,4 @@ obj-$(CONFIG_SCHED_DEBUG) += debug.o > obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o > obj-$(CONFIG_CPU_FREQ) += cpufreq.o > obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o > +obj-$(CONFIG_MEMBARRIER) += membarrier.o > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 17c667b427b4..01e3b881ab3a 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -2724,6 +2724,32 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev) > put_user(task_pid_vnr(current), current->set_child_tid); > } > > +static void membarrier_expedited_mb_after_set_current(struct mm_struct *mm, > + struct mm_struct *oldmm) > +{ > + if (!IS_ENABLED(CONFIG_MEMBARRIER)) > + return; > + /* > + * __schedule()-> > + * finish_task_switch()-> > + * if (mm) > + * mmdrop(mm) -> > + * atomic_dec_and_test() > + * takes care of issuing a memory barrier when oldmm is > + * non-NULL. We also don't need the barrier when switching to a > + * kernel thread, nor when we switch between threads belonging > + * to the same process. > + */ > + if (likely(oldmm || !mm || mm == oldmm)) > + return; > + /* > + * When switching between processes, membarrier expedited > + * private requires a memory barrier after we set the current > + * task. > + */ > + smp_mb(); > +} > + > /* > * context_switch - switch to the new MM and the new thread's register state. > */ > @@ -2737,6 +2763,7 @@ context_switch(struct rq *rq, struct task_struct *prev, > > mm = next->mm; > oldmm = prev->active_mm; > + membarrier_expedited_mb_after_set_current(mm, oldmm); > /* > * For paravirt, this is coupled with an exit in switch_to to > * combine the page table reload and the switch backend into > diff --git a/kernel/membarrier.c b/kernel/sched/membarrier.c > similarity index 59% > rename from kernel/membarrier.c > rename to kernel/sched/membarrier.c > index 9f9284f37f8d..f80828b0b607 100644 > --- a/kernel/membarrier.c > +++ b/kernel/sched/membarrier.c > @@ -17,12 +17,79 @@ > #include > #include > #include > +#include > + > +#include "sched.h" /* for cpu_rq(). */ > > /* > * Bitmask made from a "or" of all commands within enum membarrier_cmd, > * except MEMBARRIER_CMD_QUERY. > */ > -#define MEMBARRIER_CMD_BITMASK (MEMBARRIER_CMD_SHARED) > +#define MEMBARRIER_CMD_BITMASK \ > + (MEMBARRIER_CMD_SHARED | MEMBARRIER_CMD_PRIVATE_EXPEDITED) > + > +static void ipi_mb(void *info) > +{ > + smp_mb(); /* IPIs should be serializing but paranoid. */ > +} > + > +static void membarrier_private_expedited(void) > +{ > + int cpu, this_cpu; > + bool fallback = false; > + cpumask_var_t tmpmask; > + > + if (num_online_cpus() == 1) > + return; > + > + /* > + * Matches memory barriers around rq->curr modification in > + * scheduler. > + */ > + smp_mb(); /* system call entry is not a mb. */ > + > + if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) { > + /* Fallback for OOM. */ > + fallback = true; > + } > + > + /* > + * Skipping the current CPU is OK even through we can be > + * migrated at any point. The current CPU, at the point where we > + * read raw_smp_processor_id(), is ensured to be in program > + * order with respect to the caller thread. Therefore, we can > + * skip this CPU from the iteration. > + */ > + this_cpu = raw_smp_processor_id(); > + cpus_read_lock(); > + for_each_online_cpu(cpu) { > + struct task_struct *p; > + > + if (cpu == this_cpu) > + continue; > + rcu_read_lock(); > + p = task_rcu_dereference(&cpu_rq(cpu)->curr); > + if (p && p->mm == current->mm) { > + if (!fallback) > + __cpumask_set_cpu(cpu, tmpmask); > + else > + smp_call_function_single(cpu, ipi_mb, NULL, 1); > + } > + rcu_read_unlock(); > + } > + cpus_read_unlock(); > + if (!fallback) { > + smp_call_function_many(tmpmask, ipi_mb, NULL, 1); > + free_cpumask_var(tmpmask); > + } > + > + /* > + * Memory barrier on the caller thread _after_ we finished > + * waiting for the last IPI. Matches memory barriers around > + * rq->curr modification in scheduler. > + */ > + smp_mb(); /* exit from system call is not a mb */ > +} > > /** > * sys_membarrier - issue memory barriers on a set of threads > @@ -64,6 +131,9 @@ SYSCALL_DEFINE2(membarrier, int, cmd, int, flags) > if (num_online_cpus() > 1) > synchronize_sched(); > return 0; > + case MEMBARRIER_CMD_PRIVATE_EXPEDITED: > + membarrier_private_expedited(); > + return 0; > default: > return -EINVAL; > } > -- > 2.11.0 >