Date: Thu, 27 Jul 2017 15:57:27 -0700
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
        linux-kernel <linux-kernel@vger.kernel.org>,
        Boqun Feng <boqun.feng@gmail.com>, Andrew Hunter <ahh@google.com>,
        maged michael <maged.michael@gmail.com>, gromer <gromer@google.com>,
        Avi Kivity <avi@scylladb.com>
Subject: Re: [RFC PATCH v2] membarrier: expedited private command
Reply-To: paulmck@linux.vnet.ibm.com
References: <20170727211314.32666-1-mathieu.desnoyers@efficios.com>
 <20170727221357.GS3730@linux.vnet.ibm.com>
 <1653464831.29191.1501195285642.JavaMail.zimbra@efficios.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <1653464831.29191.1501195285642.JavaMail.zimbra@efficios.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Message-Id: <20170727225727.GT3730@linux.vnet.ibm.com>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 13063
Lines: 339

On Thu, Jul 27, 2017 at 10:41:25PM +0000, Mathieu Desnoyers wrote:
> ----- On Jul 27, 2017, at 6:13 PM, Paul E. McKenney paulmck@linux.vnet.ibm.com wrote:
> 
> > On Thu, Jul 27, 2017 at 05:13:14PM -0400, Mathieu Desnoyers wrote:
> >> Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built
> >> from all runqueues for which current thread's mm is the same as the
> >> thread calling sys_membarrier.
> >> 
> >> Scheduler-wise, it requires that we add a memory barrier after context
> >> switching between processes (which have different mm). Interestingly,
> >> there is already a memory barrier in mmdrop(), so we only need to add
> >> a barrier when switching from a kernel thread to a userspace thread.
> >> We also don't need to add the barrier when switching to a kernel thread,
> >> because it has no userspace memory mapping, which makes ordering of
> >> user-space memory accesses pretty much useless.
> >> 
> >> * Benchmark
> >> 
> >> A stress-test benchmark of sched pipe shows that it does not add
> >> significant overhead to the scheduler switching between processes:
> >> 
> >> 100 runs of:
> >> 
> >> taskset 01 ./perf bench sched pipe
> >> 
> >> Running 'sched/pipe' benchmark:
> >> Executed 1000000 pipe operations between two processes
> >> 
> >> Hardware: CPU: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
> >> 
> >> A) With 4.13.0-rc2+
> >>    at commit a97fb594bc7d ("virtio-net: fix module unloading")
> >> 
> >> avg.:     2.923 usecs/op
> >> std.dev:  0.057 usecs/op
> >> 
> >> B) With this commit:
> >> 
> >> avg.:     2.916 usecs/op
> >> std.dev:  0.043 usecs/op
> >> 
> >> Changes since v1:
> >> - move membarrier code under kernel/sched/ because it uses the
> >>   scheduler runqueue,
> >> - only add the barrier when we switch from a kernel thread. The case
> >>   where we switch from a user-space thread is already handled by
> >>   the atomic_dec_and_test() in mmdrop().
> >> - add a comment to mmdrop() documenting the requirement on the implicit
> >>   memory barrier.
> >> 
> >> CC: Peter Zijlstra <peterz@infradead.org>
> >> CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >> CC: Boqun Feng <boqun.feng@gmail.com>
> >> CC: Andrew Hunter <ahh@google.com>
> >> CC: Maged Michael <maged.michael@gmail.com>
> >> CC: gromer@google.com
> >> CC: Avi Kivity <avi@scylladb.com>
> >> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> > 
> > Looks much better, thank you!
> > 
> > I have queued this in place of my earlier patch for the moment.  If there
> > are no objections, I will push this into the upcoming v4.14 merge window.
> > If someone else wants to push it into v4.14, I am of course fine with
> > that, and am happy to give it a reviewed-by along the way.  But if there
> > are objections to your patch (suitably modified based on additional
> > review and testing, of course) going into v4.14, I can always fall back
> > to pushing my earlier simpler but less housebroken patch.  ;-)
> 
> I'm fine about you picking up this patch, even though it's tagged "RFC".
> I'm sure concerns will have plenty of time to be voiced by others until
> it reaches mainline anyway, at which point I'll address them and resubmit
> new versions.

Works for me!  My -rcu tree is subject to rebasing, so I can easily
replace the current patch with an updated one.

							Thanx, Paul

> Thanks!
> 
> Mathieu
> 
> > 
> >							Thanx, Paul
> > 
> >> ---
> >>  MAINTAINERS                     |  2 +-
> >>  include/linux/sched/mm.h        |  5 +++
> >>  include/uapi/linux/membarrier.h | 23 +++++++++++--
> >>  kernel/Makefile                 |  1 -
> >>  kernel/sched/Makefile           |  1 +
> >>  kernel/sched/core.c             | 27 ++++++++++++++++
> >>  kernel/{ => sched}/membarrier.c | 72 ++++++++++++++++++++++++++++++++++++++++-
> >>  7 files changed, 126 insertions(+), 5 deletions(-)
> >>  rename kernel/{ => sched}/membarrier.c (59%)
> >> 
> >> diff --git a/MAINTAINERS b/MAINTAINERS
> >> index f66488dfdbc9..3b035584272f 100644
> >> --- a/MAINTAINERS
> >> +++ b/MAINTAINERS
> >> @@ -8621,7 +8621,7 @@ M:	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> >>  M:	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> >>  L:	linux-kernel@vger.kernel.org
> >>  S:	Supported
> >> -F:	kernel/membarrier.c
> >> +F:	kernel/sched/membarrier.c
> >>  F:	include/uapi/linux/membarrier.h
> >> 
> >>  MEMORY MANAGEMENT
> >> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> >> index 2b24a6974847..5c5384d9ae0f 100644
> >> --- a/include/linux/sched/mm.h
> >> +++ b/include/linux/sched/mm.h
> >> @@ -38,6 +38,11 @@ static inline void mmgrab(struct mm_struct *mm)
> >>  extern void __mmdrop(struct mm_struct *);
> >>  static inline void mmdrop(struct mm_struct *mm)
> >>  {
> >> +	/*
> >> +	 * Implicit full memory barrier provided by
> >> +	 * atomic_dec_and_test() is required by membarrier. See comments
> >> +	 * around membarrier_expedited_mb_after_set_current().
> >> +	 */
> >>  	if (unlikely(atomic_dec_and_test(&mm->mm_count)))
> >>  		__mmdrop(mm);
> >>  }
> >> diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
> >> index e0b108bd2624..6d47b3249d8a 100644
> >> --- a/include/uapi/linux/membarrier.h
> >> +++ b/include/uapi/linux/membarrier.h
> >> @@ -40,14 +40,33 @@
> >>   *                          (non-running threads are de facto in such a
> >>   *                          state). This covers threads from all processes
> >>   *                          running on the system. This command returns 0.
> >> + * @MEMBARRIER_CMD_PRIVATE_EXPEDITED:
> >> + *                          Execute a memory barrier on each running
> >> + *                          thread belonging to the same process as the current
> >> + *                          thread. Upon return from system call, the
> >> + *                          caller thread is ensured that all its running
> >> + *                          threads siblings have passed through a state
> >> + *                          where all memory accesses to user-space
> >> + *                          addresses match program order between entry
> >> + *                          to and return from the system call
> >> + *                          (non-running threads are de facto in such a
> >> + *                          state). This only covers threads from the
> >> + *                          same processes as the caller thread. This
> >> + *                          command returns 0. The "expedited" commands
> >> + *                          complete faster than the non-expedited ones,
> >> + *                          they never block, but have the downside of
> >> + *                          causing extra overhead.
> >>   *
> >>   * Command to be passed to the membarrier system call. The commands need to
> >>   * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
> >>   * the value 0.
> >>   */
> >>  enum membarrier_cmd {
> >> -	MEMBARRIER_CMD_QUERY = 0,
> >> -	MEMBARRIER_CMD_SHARED = (1 << 0),
> >> +	MEMBARRIER_CMD_QUERY			= 0,
> >> +	MEMBARRIER_CMD_SHARED			= (1 << 0),
> >> +	/* reserved for MEMBARRIER_CMD_SHARED_EXPEDITED (1 << 1) */
> >> +	/* reserved for MEMBARRIER_CMD_PRIVATE (1 << 2) */
> >> +	MEMBARRIER_CMD_PRIVATE_EXPEDITED	= (1 << 3),
> >>  };
> >> 
> >>  #endif /* _UAPI_LINUX_MEMBARRIER_H */
> >> diff --git a/kernel/Makefile b/kernel/Makefile
> >> index 4cb8e8b23c6e..9c323a6daa46 100644
> >> --- a/kernel/Makefile
> >> +++ b/kernel/Makefile
> >> @@ -108,7 +108,6 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
> >>  obj-$(CONFIG_JUMP_LABEL) += jump_label.o
> >>  obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
> >>  obj-$(CONFIG_TORTURE_TEST) += torture.o
> >> -obj-$(CONFIG_MEMBARRIER) += membarrier.o
> >> 
> >>  obj-$(CONFIG_HAS_IOMEM) += memremap.o
> >> 
> >> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
> >> index 53f0164ed362..78f54932ea1d 100644
> >> --- a/kernel/sched/Makefile
> >> +++ b/kernel/sched/Makefile
> >> @@ -25,3 +25,4 @@ obj-$(CONFIG_SCHED_DEBUG) += debug.o
> >>  obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
> >>  obj-$(CONFIG_CPU_FREQ) += cpufreq.o
> >>  obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
> >> +obj-$(CONFIG_MEMBARRIER) += membarrier.o
> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> index 17c667b427b4..01e3b881ab3a 100644
> >> --- a/kernel/sched/core.c
> >> +++ b/kernel/sched/core.c
> >> @@ -2724,6 +2724,32 @@ asmlinkage __visible void schedule_tail(struct
> >> task_struct *prev)
> >>  		put_user(task_pid_vnr(current), current->set_child_tid);
> >>  }
> >> 
> >> +static void membarrier_expedited_mb_after_set_current(struct mm_struct *mm,
> >> +		struct mm_struct *oldmm)
> >> +{
> >> +	if (!IS_ENABLED(CONFIG_MEMBARRIER))
> >> +		return;
> >> +	/*
> >> +	 * __schedule()->
> >> +	 *   finish_task_switch()->
> >> +	 *    if (mm)
> >> +	 *      mmdrop(mm) ->
> >> +	 *        atomic_dec_and_test()
> >> +	 * takes care of issuing a memory barrier when oldmm is
> >> +	 * non-NULL. We also don't need the barrier when switching to a
> >> +	 * kernel thread, nor when we switch between threads belonging
> >> +	 * to the same process.
> >> +	 */
> >> +	if (likely(oldmm || !mm || mm == oldmm))
> >> +		return;
> >> +	/*
> >> +	 * When switching between processes, membarrier expedited
> >> +	 * private requires a memory barrier after we set the current
> >> +	 * task.
> >> +	 */
> >> +	smp_mb();
> >> +}
> >> +
> >>  /*
> >>   * context_switch - switch to the new MM and the new thread's register state.
> >>   */
> >> @@ -2737,6 +2763,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
> >> 
> >>  	mm = next->mm;
> >>  	oldmm = prev->active_mm;
> >> +	membarrier_expedited_mb_after_set_current(mm, oldmm);
> >>  	/*
> >>  	 * For paravirt, this is coupled with an exit in switch_to to
> >>  	 * combine the page table reload and the switch backend into
> >> diff --git a/kernel/membarrier.c b/kernel/sched/membarrier.c
> >> similarity index 59%
> >> rename from kernel/membarrier.c
> >> rename to kernel/sched/membarrier.c
> >> index 9f9284f37f8d..f80828b0b607 100644
> >> --- a/kernel/membarrier.c
> >> +++ b/kernel/sched/membarrier.c
> >> @@ -17,12 +17,79 @@
> >>  #include <linux/syscalls.h>
> >>  #include <linux/membarrier.h>
> >>  #include <linux/tick.h>
> >> +#include <linux/cpumask.h>
> >> +
> >> +#include "sched.h"	/* for cpu_rq(). */
> >> 
> >>  /*
> >>   * Bitmask made from a "or" of all commands within enum membarrier_cmd,
> >>   * except MEMBARRIER_CMD_QUERY.
> >>   */
> >> -#define MEMBARRIER_CMD_BITMASK	(MEMBARRIER_CMD_SHARED)
> >> +#define MEMBARRIER_CMD_BITMASK	\
> >> +	(MEMBARRIER_CMD_SHARED | MEMBARRIER_CMD_PRIVATE_EXPEDITED)
> >> +
> >> +static void ipi_mb(void *info)
> >> +{
> >> +	smp_mb();	/* IPIs should be serializing but paranoid. */
> >> +}
> >> +
> >> +static void membarrier_private_expedited(void)
> >> +{
> >> +	int cpu, this_cpu;
> >> +	bool fallback = false;
> >> +	cpumask_var_t tmpmask;
> >> +
> >> +	if (num_online_cpus() == 1)
> >> +		return;
> >> +
> >> +	/*
> >> +	 * Matches memory barriers around rq->curr modification in
> >> +	 * scheduler.
> >> +	 */
> >> +	smp_mb();	/* system call entry is not a mb. */
> >> +
> >> +	if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
> >> +		/* Fallback for OOM. */
> >> +		fallback = true;
> >> +	}
> >> +
> >> +	/*
> >> +	 * Skipping the current CPU is OK even through we can be
> >> +	 * migrated at any point. The current CPU, at the point where we
> >> +	 * read raw_smp_processor_id(), is ensured to be in program
> >> +	 * order with respect to the caller thread. Therefore, we can
> >> +	 * skip this CPU from the iteration.
> >> +	 */
> >> +	this_cpu = raw_smp_processor_id();
> >> +	cpus_read_lock();
> >> +	for_each_online_cpu(cpu) {
> >> +		struct task_struct *p;
> >> +
> >> +		if (cpu == this_cpu)
> >> +			continue;
> >> +		rcu_read_lock();
> >> +		p = task_rcu_dereference(&cpu_rq(cpu)->curr);
> >> +		if (p && p->mm == current->mm) {
> >> +			if (!fallback)
> >> +				__cpumask_set_cpu(cpu, tmpmask);
> >> +			else
> >> +				smp_call_function_single(cpu, ipi_mb, NULL, 1);
> >> +		}
> >> +		rcu_read_unlock();
> >> +	}
> >> +	cpus_read_unlock();
> >> +	if (!fallback) {
> >> +		smp_call_function_many(tmpmask, ipi_mb, NULL, 1);
> >> +		free_cpumask_var(tmpmask);
> >> +	}
> >> +
> >> +	/*
> >> +	 * Memory barrier on the caller thread _after_ we finished
> >> +	 * waiting for the last IPI. Matches memory barriers around
> >> +	 * rq->curr modification in scheduler.
> >> +	 */
> >> +	smp_mb();	/* exit from system call is not a mb */
> >> +}
> >> 
> >>  /**
> >>   * sys_membarrier - issue memory barriers on a set of threads
> >> @@ -64,6 +131,9 @@ SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
> >>  		if (num_online_cpus() > 1)
> >>  			synchronize_sched();
> >>  		return 0;
> >> +	case MEMBARRIER_CMD_PRIVATE_EXPEDITED:
> >> +		membarrier_private_expedited();
> >> +		return 0;
> >>  	default:
> >>  		return -EINVAL;
> >>  	}
> >> --
> >> 2.11.0
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
>