Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752466AbdG1PeF (ORCPT ); Fri, 28 Jul 2017 11:34:05 -0400 Received: from mail.efficios.com ([167.114.142.141]:39798 "EHLO mail.efficios.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752359AbdG1PeD (ORCPT ); Fri, 28 Jul 2017 11:34:03 -0400 Date: Fri, 28 Jul 2017 15:36:09 +0000 (UTC) From: Mathieu Desnoyers To: Peter Zijlstra Cc: "Paul E. McKenney" , linux-kernel , Boqun Feng , Andrew Hunter , maged michael , gromer , Avi Kivity , Benjamin Herrenschmidt , Paul Mackerras , Michael Ellerman Message-ID: <624649192.29564.1501256169885.JavaMail.zimbra@efficios.com> In-Reply-To: <20170728085532.ylhuz2irwmgpmejv@hirez.programming.kicks-ass.net> References: <20170727211314.32666-1-mathieu.desnoyers@efficios.com> <20170728085532.ylhuz2irwmgpmejv@hirez.programming.kicks-ass.net> Subject: Re: [RFC PATCH v2] membarrier: expedited private command MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [167.114.142.141] X-Mailer: Zimbra 8.7.9_GA_1794 (ZimbraWebClient - FF52 (Linux)/8.7.9_GA_1794) Thread-Topic: membarrier: expedited private command Thread-Index: OZ/T3e0mFmjDbueiGNZFgEEXlEtCxw== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5879 Lines: 205 ----- On Jul 28, 2017, at 4:55 AM, Peter Zijlstra peterz@infradead.org wrote: > On Thu, Jul 27, 2017 at 05:13:14PM -0400, Mathieu Desnoyers wrote: >> +static void membarrier_expedited_mb_after_set_current(struct mm_struct *mm, >> + struct mm_struct *oldmm) > > That is a bit of a mouth-full... > >> +{ >> + if (!IS_ENABLED(CONFIG_MEMBARRIER)) >> + return; >> + /* >> + * __schedule() >> + * finish_task_switch() >> + * if (mm) >> + * mmdrop(mm) >> + * atomic_dec_and_test() > * >> + * takes care of issuing a memory barrier when oldmm is >> + * non-NULL. We also don't need the barrier when switching to a >> + * kernel thread, nor when we switch between threads belonging >> + * to the same process. >> + */ >> + if (likely(oldmm || !mm || mm == oldmm)) >> + return; >> + /* >> + * When switching between processes, membarrier expedited >> + * private requires a memory barrier after we set the current >> + * task. >> + */ >> + smp_mb(); >> +} > > And because of what it complements, I would have expected the callsite: > >> @@ -2737,6 +2763,7 @@ context_switch(struct rq *rq, struct task_struct *prev, >> >> mm = next->mm; >> oldmm = prev->active_mm; >> + membarrier_expedited_mb_after_set_current(mm, oldmm); >> /* >> * For paravirt, this is coupled with an exit in switch_to to >> * combine the page table reload and the switch backend into > > to be in finish_task_switch(), something like: > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index e9785f7aed75..33f34a201255 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -2641,8 +2641,18 @@ static struct rq *finish_task_switch(struct task_struct > *prev) > finish_arch_post_lock_switch(); > > fire_sched_in_preempt_notifiers(current); > + > + /* > + * For CONFIG_MEMBARRIER we need a full memory barrier after the > + * rq->curr assignment. Not all architectures have one in either > + * switch_to() or switch_mm() so we use (and complement) the one > + * implied by mmdrop()'s atomic_dec_and_test(). > + */ > if (mm) > mmdrop(mm); > + else if (IS_ENABLED(CONFIG_MEMBARRIER)) > + smp_mb(); > + > if (unlikely(prev_state == TASK_DEAD)) { > if (prev->sched_class->task_dead) > prev->sched_class->task_dead(prev); > > > I realize this is sub-optimal if we're switching to a kernel thread, so > it might want some work, then again, a whole bunch of architectures > don't in fact need this extra barrier at all. As discussed on IRC, I plan to go instead for: diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 01e3b881ab3a..dd677fb2ee92 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2636,6 +2636,11 @@ static struct rq *finish_task_switch(struct task_struct * prev) vtime_task_switch(prev); perf_event_task_sched_in(prev, current); finish_lock_switch(rq, prev); + /* + * The membarrier system call requires a full memory barrier + * after storing to rq->curr, before going back to user-space. + */ + smp_mb__after_unlock_lock(); finish_arch_post_lock_switch(); fire_sched_in_preempt_notifiers(current); Which is free on most architectures, except those defining CONFIG_ARCH_WEAK_RELEASE_ACQUIRE. CCing PPC maintainers. > >> +static void membarrier_private_expedited(void) >> +{ >> + int cpu, this_cpu; >> + bool fallback = false; >> + cpumask_var_t tmpmask; >> + >> + if (num_online_cpus() == 1) >> + return; >> + >> + /* >> + * Matches memory barriers around rq->curr modification in >> + * scheduler. >> + */ >> + smp_mb(); /* system call entry is not a mb. */ >> + > > Weren't you going to put in a comment on that GFP_NOWAIT thing? I only added it to the uapi header. Adding this to the implementation too: + /* + * Expedited membarrier commands guarantee that they won't + * block, hence the GFP_NOWAIT allocation flag and fallback + * implementation. + */ > >> + if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) { > > You really want: zalloc_cpumask_var(). ok > >> + /* Fallback for OOM. */ >> + fallback = true; >> + } >> + >> + /* >> + * Skipping the current CPU is OK even through we can be >> + * migrated at any point. The current CPU, at the point where we >> + * read raw_smp_processor_id(), is ensured to be in program >> + * order with respect to the caller thread. Therefore, we can >> + * skip this CPU from the iteration. >> + */ >> + this_cpu = raw_smp_processor_id(); > > So if instead you do the below, that is still true, but you have the > opportunity to skip moar CPUs, then again, if you migrate the wrong way > you'll end up not skipping yourself.. a well. Chances are better to skip more CPUs in face of migration if we do it in the loop as you suggest. Will do. > >> + cpus_read_lock(); >> + for_each_online_cpu(cpu) { >> + struct task_struct *p; >> + > if (cpu == raw_smp_processor_id()) > continue; > >> + rcu_read_lock(); >> + p = task_rcu_dereference(&cpu_rq(cpu)->curr); >> + if (p && p->mm == current->mm) { >> + if (!fallback) >> + __cpumask_set_cpu(cpu, tmpmask); >> + else >> + smp_call_function_single(cpu, ipi_mb, NULL, 1); >> + } >> + rcu_read_unlock(); >> + } >> + cpus_read_unlock(); > > This ^, wants to go after that v > >> + if (!fallback) { >> + smp_call_function_many(tmpmask, ipi_mb, NULL, 1); >> + free_cpumask_var(tmpmask); >> + } > > Because otherwise the bits in your tmpmask might no longer match the > online state. Good point, thanks! Mathieu > >> + >> + /* >> + * Memory barrier on the caller thread _after_ we finished >> + * waiting for the last IPI. Matches memory barriers around >> + * rq->curr modification in scheduler. >> + */ >> + smp_mb(); /* exit from system call is not a mb */ > > +} -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com