Date: Thu, 27 Jul 2017 22:41:25 +0000 (UTC)
From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
        linux-kernel <linux-kernel@vger.kernel.org>,
        Boqun Feng <boqun.feng@gmail.com>, Andrew Hunter <ahh@google.com>,
        maged michael <maged.michael@gmail.com>, gromer <gromer@google.com>,
        Avi Kivity <avi@scylladb.com>
Message-ID: <1653464831.29191.1501195285642.JavaMail.zimbra@efficios.com>
In-Reply-To: <20170727221357.GS3730@linux.vnet.ibm.com>
References: <20170727211314.32666-1-mathieu.desnoyers@efficios.com> <20170727221357.GS3730@linux.vnet.ibm.com>
Subject: Re: [RFC PATCH v2] membarrier: expedited private command
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Thread-Topic: membarrier: expedited private command
Thread-Index: bV16GxKAiUHSrABRiTJ9KNslltKsSg==
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 12195
Lines: 332

----- On Jul 27, 2017, at 6:13 PM, Paul E. McKenney paulmck@linux.vnet.ibm.com wrote:

> On Thu, Jul 27, 2017 at 05:13:14PM -0400, Mathieu Desnoyers wrote:
>> Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built
>> from all runqueues for which current thread's mm is the same as the
>> thread calling sys_membarrier.
>> 
>> Scheduler-wise, it requires that we add a memory barrier after context
>> switching between processes (which have different mm). Interestingly,
>> there is already a memory barrier in mmdrop(), so we only need to add
>> a barrier when switching from a kernel thread to a userspace thread.
>> We also don't need to add the barrier when switching to a kernel thread,
>> because it has no userspace memory mapping, which makes ordering of
>> user-space memory accesses pretty much useless.
>> 
>> * Benchmark
>> 
>> A stress-test benchmark of sched pipe shows that it does not add
>> significant overhead to the scheduler switching between processes:
>> 
>> 100 runs of:
>> 
>> taskset 01 ./perf bench sched pipe
>> 
>> Running 'sched/pipe' benchmark:
>> Executed 1000000 pipe operations between two processes
>> 
>> Hardware: CPU: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
>> 
>> A) With 4.13.0-rc2+
>>    at commit a97fb594bc7d ("virtio-net: fix module unloading")
>> 
>> avg.:     2.923 usecs/op
>> std.dev:  0.057 usecs/op
>> 
>> B) With this commit:
>> 
>> avg.:     2.916 usecs/op
>> std.dev:  0.043 usecs/op
>> 
>> Changes since v1:
>> - move membarrier code under kernel/sched/ because it uses the
>>   scheduler runqueue,
>> - only add the barrier when we switch from a kernel thread. The case
>>   where we switch from a user-space thread is already handled by
>>   the atomic_dec_and_test() in mmdrop().
>> - add a comment to mmdrop() documenting the requirement on the implicit
>>   memory barrier.
>> 
>> CC: Peter Zijlstra <peterz@infradead.org>
>> CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>> CC: Boqun Feng <boqun.feng@gmail.com>
>> CC: Andrew Hunter <ahh@google.com>
>> CC: Maged Michael <maged.michael@gmail.com>
>> CC: gromer@google.com
>> CC: Avi Kivity <avi@scylladb.com>
>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> 
> Looks much better, thank you!
> 
> I have queued this in place of my earlier patch for the moment.  If there
> are no objections, I will push this into the upcoming v4.14 merge window.
> If someone else wants to push it into v4.14, I am of course fine with
> that, and am happy to give it a reviewed-by along the way.  But if there
> are objections to your patch (suitably modified based on additional
> review and testing, of course) going into v4.14, I can always fall back
> to pushing my earlier simpler but less housebroken patch.  ;-)

I'm fine about you picking up this patch, even though it's tagged "RFC".
I'm sure concerns will have plenty of time to be voiced by others until
it reaches mainline anyway, at which point I'll address them and resubmit
new versions.

Thanks!

Mathieu

> 
>							Thanx, Paul
> 
>> ---
>>  MAINTAINERS                     |  2 +-
>>  include/linux/sched/mm.h        |  5 +++
>>  include/uapi/linux/membarrier.h | 23 +++++++++++--
>>  kernel/Makefile                 |  1 -
>>  kernel/sched/Makefile           |  1 +
>>  kernel/sched/core.c             | 27 ++++++++++++++++
>>  kernel/{ => sched}/membarrier.c | 72 ++++++++++++++++++++++++++++++++++++++++-
>>  7 files changed, 126 insertions(+), 5 deletions(-)
>>  rename kernel/{ => sched}/membarrier.c (59%)
>> 
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index f66488dfdbc9..3b035584272f 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -8621,7 +8621,7 @@ M:	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>>  M:	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
>>  L:	linux-kernel@vger.kernel.org
>>  S:	Supported
>> -F:	kernel/membarrier.c
>> +F:	kernel/sched/membarrier.c
>>  F:	include/uapi/linux/membarrier.h
>> 
>>  MEMORY MANAGEMENT
>> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
>> index 2b24a6974847..5c5384d9ae0f 100644
>> --- a/include/linux/sched/mm.h
>> +++ b/include/linux/sched/mm.h
>> @@ -38,6 +38,11 @@ static inline void mmgrab(struct mm_struct *mm)
>>  extern void __mmdrop(struct mm_struct *);
>>  static inline void mmdrop(struct mm_struct *mm)
>>  {
>> +	/*
>> +	 * Implicit full memory barrier provided by
>> +	 * atomic_dec_and_test() is required by membarrier. See comments
>> +	 * around membarrier_expedited_mb_after_set_current().
>> +	 */
>>  	if (unlikely(atomic_dec_and_test(&mm->mm_count)))
>>  		__mmdrop(mm);
>>  }
>> diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
>> index e0b108bd2624..6d47b3249d8a 100644
>> --- a/include/uapi/linux/membarrier.h
>> +++ b/include/uapi/linux/membarrier.h
>> @@ -40,14 +40,33 @@
>>   *                          (non-running threads are de facto in such a
>>   *                          state). This covers threads from all processes
>>   *                          running on the system. This command returns 0.
>> + * @MEMBARRIER_CMD_PRIVATE_EXPEDITED:
>> + *                          Execute a memory barrier on each running
>> + *                          thread belonging to the same process as the current
>> + *                          thread. Upon return from system call, the
>> + *                          caller thread is ensured that all its running
>> + *                          threads siblings have passed through a state
>> + *                          where all memory accesses to user-space
>> + *                          addresses match program order between entry
>> + *                          to and return from the system call
>> + *                          (non-running threads are de facto in such a
>> + *                          state). This only covers threads from the
>> + *                          same processes as the caller thread. This
>> + *                          command returns 0. The "expedited" commands
>> + *                          complete faster than the non-expedited ones,
>> + *                          they never block, but have the downside of
>> + *                          causing extra overhead.
>>   *
>>   * Command to be passed to the membarrier system call. The commands need to
>>   * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
>>   * the value 0.
>>   */
>>  enum membarrier_cmd {
>> -	MEMBARRIER_CMD_QUERY = 0,
>> -	MEMBARRIER_CMD_SHARED = (1 << 0),
>> +	MEMBARRIER_CMD_QUERY			= 0,
>> +	MEMBARRIER_CMD_SHARED			= (1 << 0),
>> +	/* reserved for MEMBARRIER_CMD_SHARED_EXPEDITED (1 << 1) */
>> +	/* reserved for MEMBARRIER_CMD_PRIVATE (1 << 2) */
>> +	MEMBARRIER_CMD_PRIVATE_EXPEDITED	= (1 << 3),
>>  };
>> 
>>  #endif /* _UAPI_LINUX_MEMBARRIER_H */
>> diff --git a/kernel/Makefile b/kernel/Makefile
>> index 4cb8e8b23c6e..9c323a6daa46 100644
>> --- a/kernel/Makefile
>> +++ b/kernel/Makefile
>> @@ -108,7 +108,6 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
>>  obj-$(CONFIG_JUMP_LABEL) += jump_label.o
>>  obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
>>  obj-$(CONFIG_TORTURE_TEST) += torture.o
>> -obj-$(CONFIG_MEMBARRIER) += membarrier.o
>> 
>>  obj-$(CONFIG_HAS_IOMEM) += memremap.o
>> 
>> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
>> index 53f0164ed362..78f54932ea1d 100644
>> --- a/kernel/sched/Makefile
>> +++ b/kernel/sched/Makefile
>> @@ -25,3 +25,4 @@ obj-$(CONFIG_SCHED_DEBUG) += debug.o
>>  obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
>>  obj-$(CONFIG_CPU_FREQ) += cpufreq.o
>>  obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
>> +obj-$(CONFIG_MEMBARRIER) += membarrier.o
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 17c667b427b4..01e3b881ab3a 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2724,6 +2724,32 @@ asmlinkage __visible void schedule_tail(struct
>> task_struct *prev)
>>  		put_user(task_pid_vnr(current), current->set_child_tid);
>>  }
>> 
>> +static void membarrier_expedited_mb_after_set_current(struct mm_struct *mm,
>> +		struct mm_struct *oldmm)
>> +{
>> +	if (!IS_ENABLED(CONFIG_MEMBARRIER))
>> +		return;
>> +	/*
>> +	 * __schedule()->
>> +	 *   finish_task_switch()->
>> +	 *    if (mm)
>> +	 *      mmdrop(mm) ->
>> +	 *        atomic_dec_and_test()
>> +	 * takes care of issuing a memory barrier when oldmm is
>> +	 * non-NULL. We also don't need the barrier when switching to a
>> +	 * kernel thread, nor when we switch between threads belonging
>> +	 * to the same process.
>> +	 */
>> +	if (likely(oldmm || !mm || mm == oldmm))
>> +		return;
>> +	/*
>> +	 * When switching between processes, membarrier expedited
>> +	 * private requires a memory barrier after we set the current
>> +	 * task.
>> +	 */
>> +	smp_mb();
>> +}
>> +
>>  /*
>>   * context_switch - switch to the new MM and the new thread's register state.
>>   */
>> @@ -2737,6 +2763,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
>> 
>>  	mm = next->mm;
>>  	oldmm = prev->active_mm;
>> +	membarrier_expedited_mb_after_set_current(mm, oldmm);
>>  	/*
>>  	 * For paravirt, this is coupled with an exit in switch_to to
>>  	 * combine the page table reload and the switch backend into
>> diff --git a/kernel/membarrier.c b/kernel/sched/membarrier.c
>> similarity index 59%
>> rename from kernel/membarrier.c
>> rename to kernel/sched/membarrier.c
>> index 9f9284f37f8d..f80828b0b607 100644
>> --- a/kernel/membarrier.c
>> +++ b/kernel/sched/membarrier.c
>> @@ -17,12 +17,79 @@
>>  #include <linux/syscalls.h>
>>  #include <linux/membarrier.h>
>>  #include <linux/tick.h>
>> +#include <linux/cpumask.h>
>> +
>> +#include "sched.h"	/* for cpu_rq(). */
>> 
>>  /*
>>   * Bitmask made from a "or" of all commands within enum membarrier_cmd,
>>   * except MEMBARRIER_CMD_QUERY.
>>   */
>> -#define MEMBARRIER_CMD_BITMASK	(MEMBARRIER_CMD_SHARED)
>> +#define MEMBARRIER_CMD_BITMASK	\
>> +	(MEMBARRIER_CMD_SHARED | MEMBARRIER_CMD_PRIVATE_EXPEDITED)
>> +
>> +static void ipi_mb(void *info)
>> +{
>> +	smp_mb();	/* IPIs should be serializing but paranoid. */
>> +}
>> +
>> +static void membarrier_private_expedited(void)
>> +{
>> +	int cpu, this_cpu;
>> +	bool fallback = false;
>> +	cpumask_var_t tmpmask;
>> +
>> +	if (num_online_cpus() == 1)
>> +		return;
>> +
>> +	/*
>> +	 * Matches memory barriers around rq->curr modification in
>> +	 * scheduler.
>> +	 */
>> +	smp_mb();	/* system call entry is not a mb. */
>> +
>> +	if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
>> +		/* Fallback for OOM. */
>> +		fallback = true;
>> +	}
>> +
>> +	/*
>> +	 * Skipping the current CPU is OK even through we can be
>> +	 * migrated at any point. The current CPU, at the point where we
>> +	 * read raw_smp_processor_id(), is ensured to be in program
>> +	 * order with respect to the caller thread. Therefore, we can
>> +	 * skip this CPU from the iteration.
>> +	 */
>> +	this_cpu = raw_smp_processor_id();
>> +	cpus_read_lock();
>> +	for_each_online_cpu(cpu) {
>> +		struct task_struct *p;
>> +
>> +		if (cpu == this_cpu)
>> +			continue;
>> +		rcu_read_lock();
>> +		p = task_rcu_dereference(&cpu_rq(cpu)->curr);
>> +		if (p && p->mm == current->mm) {
>> +			if (!fallback)
>> +				__cpumask_set_cpu(cpu, tmpmask);
>> +			else
>> +				smp_call_function_single(cpu, ipi_mb, NULL, 1);
>> +		}
>> +		rcu_read_unlock();
>> +	}
>> +	cpus_read_unlock();
>> +	if (!fallback) {
>> +		smp_call_function_many(tmpmask, ipi_mb, NULL, 1);
>> +		free_cpumask_var(tmpmask);
>> +	}
>> +
>> +	/*
>> +	 * Memory barrier on the caller thread _after_ we finished
>> +	 * waiting for the last IPI. Matches memory barriers around
>> +	 * rq->curr modification in scheduler.
>> +	 */
>> +	smp_mb();	/* exit from system call is not a mb */
>> +}
>> 
>>  /**
>>   * sys_membarrier - issue memory barriers on a set of threads
>> @@ -64,6 +131,9 @@ SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
>>  		if (num_online_cpus() > 1)
>>  			synchronize_sched();
>>  		return 0;
>> +	case MEMBARRIER_CMD_PRIVATE_EXPEDITED:
>> +		membarrier_private_expedited();
>> +		return 0;
>>  	default:
>>  		return -EINVAL;
>>  	}
>> --
>> 2.11.0

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com