Date: Thu, 27 Jul 2017 20:58:38 +0000 (UTC)
From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Avi Kivity <avi@scylladb.com>, maged michael <maged.michael@gmail.com>,
        Andrew Hunter <ahh@google.com>, gromer@google.com,
        linux-kernel <linux-kernel@vger.kernel.org>
Message-ID: <1035920775.29109.1501189118277.JavaMail.zimbra@efficios.com>
In-Reply-To: <20170727203706.GO3730@linux.vnet.ibm.com>
References: <20170727181250.GA20183@linux.vnet.ibm.com> <5c8c6946-ce3a-6183-76a2-027823a9948a@scylladb.com> <20170727194322.GL3730@linux.vnet.ibm.com> <5fe39d32-5fc1-3a59-23fc-9bdb1d90edf9@scylladb.com> <20170727203706.GO3730@linux.vnet.ibm.com>
Subject: Re: Udpated sys_membarrier() speedup patch, FYI
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Thread-Topic: Udpated sys_membarrier() speedup patch, FYI
Thread-Index: Zq6dVh+YeLSLW0CGGUGiA/6gU3H9sw==
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 6901
Lines: 190

----- On Jul 27, 2017, at 4:37 PM, Paul E. McKenney paulmck@linux.vnet.ibm.com wrote:

> On Thu, Jul 27, 2017 at 11:04:13PM +0300, Avi Kivity wrote:
>> On 07/27/2017 10:43 PM, Paul E. McKenney wrote:
>> >On Thu, Jul 27, 2017 at 10:20:14PM +0300, Avi Kivity wrote:
>> >>On 07/27/2017 09:12 PM, Paul E. McKenney wrote:
>> >>>Hello!
>> >>>
>> >>>Please see below for a prototype sys_membarrier() speedup patch.
>> >>>Please note that there is some controversy on this subject, so the final
>> >>>version will probably be quite a bit different than this prototype.
>> >>>
>> >>>But my main question is whether the throttling shown below is acceptable
>> >>>for your use cases, namely only one expedited sys_membarrier() permitted
>> >>>per scheduling-clock period (1 millisecond on many platforms), with any
>> >>>excess being silently converted to non-expedited form.  The reason for
>> >>>the throttling is concerns about DoS attacks based on user code with a
>> >>>tight loop invoking this system call.
>> >>>
>> >>>Thoughts?
>> >>Silent throttling would render it useless for me. -EAGAIN is a
>> >>little better, but I'd be forced to spin until either I get kicked
>> >>out of my loop, or it succeeds.
>> >>
>> >>IPIing only running threads of my process would be perfect. In fact
>> >>I might even be able to make use of "membarrier these threads
>> >>please" to reduce IPIs, when I change the topology from fully
>> >>connected to something more sparse, on larger machines.
>> >>
>> >>My previous implementations were a signal (but that's horrible on
>> >>large machines) and trylock + mprotect (but that doesn't work on
>> >>ARM).
>> >OK, how about the following patch, which IPIs only the running
>> >threads of the process doing the sys_membarrier()?
>> 
>> Works for me.
> 
> Thank you for testing!  I expect that Mathieu will have a v2 soon,
> hopefully CCing you guys.  (If not, I will forward it.)
> 

Will do!

> Mathieu, please note Avi's feedback below.

More below,

> 
>							Thanx, Paul
> 
>> >------------------------------------------------------------------------
>> >
>> >From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> >To: Peter Zijlstra <peterz@infradead.org>
>> >Cc: linux-kernel@vger.kernel.org, Mathieu Desnoyers
>> >  <mathieu.desnoyers@efficios.com>,
>> >  "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>, Boqun Feng
>> >  <boqun.feng@gmail.com>
>> >Subject: [RFC PATCH] membarrier: expedited private command
>> >Date: Thu, 27 Jul 2017 14:59:43 -0400
>> >Message-Id: <20170727185943.11570-1-mathieu.desnoyers@efficios.com>
>> >
>> >Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built
>> >from all runqueues for which current thread's mm is the same as our own.
>> >
>> >Scheduler-wise, it requires that we add a memory barrier after context
>> >switching between processes (which have different mm).
>> >
>> >It would be interesting to benchmark the overhead of this added barrier
>> >on the performance of context switching between processes. If the
>> >preexisting overhead of switching between mm is high enough, the
>> >overhead of adding this extra barrier may be insignificant.
>> >
>> >[ Compile-tested only! ]
>> >
>> >CC: Peter Zijlstra <peterz@infradead.org>
>> >CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>> >CC: Boqun Feng <boqun.feng@gmail.com>
>> >Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> >---
>> >  include/uapi/linux/membarrier.h |  8 +++--
>> >  kernel/membarrier.c             | 76 ++++++++++++++++++++++++++++++++++++++++-
>> >  kernel/sched/core.c             | 21 ++++++++++++
>> >  3 files changed, 102 insertions(+), 3 deletions(-)
>> >
>> >diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
>> >index e0b108bd2624..6a33c5852f6b 100644
>> >--- a/include/uapi/linux/membarrier.h
>> >+++ b/include/uapi/linux/membarrier.h
>> >@@ -40,14 +40,18 @@
>> >   *                          (non-running threads are de facto in such a
>> >   *                          state). This covers threads from all processes
>> >   *                          running on the system. This command returns 0.
>> >+ * TODO: documentation.
>> >   *
>> >   * Command to be passed to the membarrier system call. The commands need to
>> >   * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
>> >   * the value 0.
>> >   */
>> >  enum membarrier_cmd {
>> >-	MEMBARRIER_CMD_QUERY = 0,
>> >-	MEMBARRIER_CMD_SHARED = (1 << 0),
>> >+	MEMBARRIER_CMD_QUERY			= 0,
>> >+	MEMBARRIER_CMD_SHARED			= (1 << 0),
>> >+	/* reserved for MEMBARRIER_CMD_SHARED_EXPEDITED (1 << 1) */
>> >+	/* reserved for MEMBARRIER_CMD_PRIVATE (1 << 2) */
>> >+	MEMBARRIER_CMD_PRIVATE_EXPEDITED	= (1 << 3),
>> >  };
>> >
>> >  #endif /* _UAPI_LINUX_MEMBARRIER_H */
>> >diff --git a/kernel/membarrier.c b/kernel/membarrier.c
>> >index 9f9284f37f8d..8c6c0f96f617 100644
>> >--- a/kernel/membarrier.c
>> >+++ b/kernel/membarrier.c
>> >@@ -19,10 +19,81 @@
>> >  #include <linux/tick.h>
>> >
>> >  /*
>> >+ * XXX For cpu_rq(). Should we rather move
>> >+ * membarrier_private_expedited() to sched/core.c or create
>> >+ * sched/membarrier.c ?
>> >+ */
>> >+#include "sched/sched.h"
>> >+
>> >+/*
>> >   * Bitmask made from a "or" of all commands within enum membarrier_cmd,
>> >   * except MEMBARRIER_CMD_QUERY.
>> >   */
>> >-#define MEMBARRIER_CMD_BITMASK	(MEMBARRIER_CMD_SHARED)
>> >+#define MEMBARRIER_CMD_BITMASK	\
>> >+	(MEMBARRIER_CMD_SHARED | MEMBARRIER_CMD_PRIVATE_EXPEDITED)
>> >+
>> 
>> >	rcu_read_unlock();
>> >+	}
>> >+}
>> >+
>> >+static void membarrier_private_expedited(void)
>> >+{
>> >+	int cpu, this_cpu;
>> >+	cpumask_var_t tmpmask;
>> >+
>> >+	if (num_online_cpus() == 1)
>> >+		return;
>> >+
>> >+	/*
>> >+	 * Matches memory barriers around rq->curr modification in
>> >+	 * scheduler.
>> >+	 */
>> >+	smp_mb();	/* system call entry is not a mb. */
>> >+
>> >+	if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
>> >+		/* Fallback for OOM. */
>> >+		membarrier_private_expedited_ipi_each();
>> >+		goto end;
>> >+	}
>> >+
>> >+	this_cpu = raw_smp_processor_id();
>> >+	for_each_online_cpu(cpu) {
>> >+		struct task_struct *p;
>> >+
>> >+		if (cpu == this_cpu)
>> >+			continue;
>> >+		rcu_read_lock();
>> >+		p = task_rcu_dereference(&cpu_rq(cpu)->curr);
>> >+		if (p && p->mm == current->mm)
>> >+			__cpumask_set_cpu(cpu, tmpmask);
>> 
>> This gets you some false positives, if the CPU idled then mm will
>> not have changed.
> 
> Good point!  The battery-powered embedded guys would probably prefer
> we not needlessly IPI idle CPUs.  We cannot rely on RCU's dyntick-idle
> state in nohz_full cases.  Not sure if is_idle_task() can be used
> safely, given things like play_idle().

Would changing the check in this loop to:

if (p && !is_idle_task(p) && p->mm == current->mm) {

work for you ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com