Date: Fri, 28 Jul 2017 15:36:09 +0000 (UTC)
From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
        linux-kernel <linux-kernel@vger.kernel.org>,
        Boqun Feng <boqun.feng@gmail.com>, Andrew Hunter <ahh@google.com>,
        maged michael <maged.michael@gmail.com>, gromer <gromer@google.com>,
        Avi Kivity <avi@scylladb.com>,
        Benjamin Herrenschmidt <benh@kernel.crashing.org>,
        Paul Mackerras <paulus@samba.org>,
        Michael Ellerman <mpe@ellerman.id.au>
Message-ID: <624649192.29564.1501256169885.JavaMail.zimbra@efficios.com>
In-Reply-To: <20170728085532.ylhuz2irwmgpmejv@hirez.programming.kicks-ass.net>
References: <20170727211314.32666-1-mathieu.desnoyers@efficios.com> <20170728085532.ylhuz2irwmgpmejv@hirez.programming.kicks-ass.net>
Subject: Re: [RFC PATCH v2] membarrier: expedited private command
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Thread-Topic: membarrier: expedited private command
Thread-Index: OZ/T3e0mFmjDbueiGNZFgEEXlEtCxw==
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 5879
Lines: 205

----- On Jul 28, 2017, at 4:55 AM, Peter Zijlstra peterz@infradead.org wrote:

> On Thu, Jul 27, 2017 at 05:13:14PM -0400, Mathieu Desnoyers wrote:
>> +static void membarrier_expedited_mb_after_set_current(struct mm_struct *mm,
>> +		struct mm_struct *oldmm)
> 
> That is a bit of a mouth-full...
> 
>> +{
>> +	if (!IS_ENABLED(CONFIG_MEMBARRIER))
>> +		return;
>> +	/*
>> +	 * __schedule()
>> +	 *   finish_task_switch()
>> +	 *    if (mm)
>> +	 *      mmdrop(mm)
>> +	 *        atomic_dec_and_test()
>         *
>> +	 * takes care of issuing a memory barrier when oldmm is
>> +	 * non-NULL. We also don't need the barrier when switching to a
>> +	 * kernel thread, nor when we switch between threads belonging
>> +	 * to the same process.
>> +	 */
>> +	if (likely(oldmm || !mm || mm == oldmm))
>> +		return;
>> +	/*
>> +	 * When switching between processes, membarrier expedited
>> +	 * private requires a memory barrier after we set the current
>> +	 * task.
>> +	 */
>> +	smp_mb();
>> +}
> 
> And because of what it complements, I would have expected the callsite:
> 
>> @@ -2737,6 +2763,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
>>  
>>  	mm = next->mm;
>>  	oldmm = prev->active_mm;
>> +	membarrier_expedited_mb_after_set_current(mm, oldmm);
>>  	/*
>>  	 * For paravirt, this is coupled with an exit in switch_to to
>>  	 * combine the page table reload and the switch backend into
> 
> to be in finish_task_switch(), something like:
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e9785f7aed75..33f34a201255 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2641,8 +2641,18 @@ static struct rq *finish_task_switch(struct task_struct
> *prev)
> 	finish_arch_post_lock_switch();
> 
> 	fire_sched_in_preempt_notifiers(current);
> +
> +	/*
> +	 * For CONFIG_MEMBARRIER we need a full memory barrier after the
> +	 * rq->curr assignment. Not all architectures have one in either
> +	 * switch_to() or switch_mm() so we use (and complement) the one
> +	 * implied by mmdrop()'s atomic_dec_and_test().
> +	 */
> 	if (mm)
> 		mmdrop(mm);
> +	else if (IS_ENABLED(CONFIG_MEMBARRIER))
> +		smp_mb();
> +
> 	if (unlikely(prev_state == TASK_DEAD)) {
> 		if (prev->sched_class->task_dead)
> 			prev->sched_class->task_dead(prev);
> 
> 
> I realize this is sub-optimal if we're switching to a kernel thread, so
> it might want some work, then again, a whole bunch of architectures
> don't in fact need this extra barrier at all.

As discussed on IRC, I plan to go instead for:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 01e3b881ab3a..dd677fb2ee92 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2636,6 +2636,11 @@ static struct rq *finish_task_switch(struct task_struct *
prev)
        vtime_task_switch(prev);
        perf_event_task_sched_in(prev, current);
        finish_lock_switch(rq, prev);
+       /*
+        * The membarrier system call requires a full memory barrier
+        * after storing to rq->curr, before going back to user-space.
+        */
+       smp_mb__after_unlock_lock();
        finish_arch_post_lock_switch();
 
        fire_sched_in_preempt_notifiers(current);

Which is free on most architectures, except those defining
CONFIG_ARCH_WEAK_RELEASE_ACQUIRE. CCing PPC maintainers.

> 
>> +static void membarrier_private_expedited(void)
>> +{
>> +	int cpu, this_cpu;
>> +	bool fallback = false;
>> +	cpumask_var_t tmpmask;
>> +
>> +	if (num_online_cpus() == 1)
>> +		return;
>> +
>> +	/*
>> +	 * Matches memory barriers around rq->curr modification in
>> +	 * scheduler.
>> +	 */
>> +	smp_mb();	/* system call entry is not a mb. */
>> +
> 
> Weren't you going to put in a comment on that GFP_NOWAIT thing?

I only added it to the uapi header. Adding this to the implementation
too:

+       /*
+        * Expedited membarrier commands guarantee that they won't
+        * block, hence the GFP_NOWAIT allocation flag and fallback
+        * implementation.
+        */


> 
>> +	if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
> 
> You really want: zalloc_cpumask_var().

ok

> 
>> +		/* Fallback for OOM. */
>> +		fallback = true;
>> +	}
>> +
>> +	/*
>> +	 * Skipping the current CPU is OK even through we can be
>> +	 * migrated at any point. The current CPU, at the point where we
>> +	 * read raw_smp_processor_id(), is ensured to be in program
>> +	 * order with respect to the caller thread. Therefore, we can
>> +	 * skip this CPU from the iteration.
>> +	 */
>> +	this_cpu = raw_smp_processor_id();
> 
> So if instead you do the below, that is still true, but you have the
> opportunity to skip moar CPUs, then again, if you migrate the wrong way
> you'll end up not skipping yourself.. a well.

Chances are better to skip more CPUs in face of migration if we do it
in the loop as you suggest. Will do.

> 
>> +	cpus_read_lock();
>> +	for_each_online_cpu(cpu) {
>> +		struct task_struct *p;
>> +
>		if (cpu == raw_smp_processor_id())
>			continue;
> 
>> +		rcu_read_lock();
>> +		p = task_rcu_dereference(&cpu_rq(cpu)->curr);
>> +		if (p && p->mm == current->mm) {
>> +			if (!fallback)
>> +				__cpumask_set_cpu(cpu, tmpmask);
>> +			else
>> +				smp_call_function_single(cpu, ipi_mb, NULL, 1);
>> +		}
>> +		rcu_read_unlock();
>> +	}
>> +	cpus_read_unlock();
> 
> This ^, wants to go after that v
> 
>> +	if (!fallback) {
>> +		smp_call_function_many(tmpmask, ipi_mb, NULL, 1);
>> +		free_cpumask_var(tmpmask);
>> +	}
> 
> Because otherwise the bits in your tmpmask might no longer match the
> online state.

Good point, thanks!

Mathieu

> 
>> +
>> +	/*
>> +	 * Memory barrier on the caller thread _after_ we finished
>> +	 * waiting for the last IPI. Matches memory barriers around
>> +	 * rq->curr modification in scheduler.
>> +	 */
>> +	smp_mb();	/* exit from system call is not a mb */
> > +}

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com