Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751784AbdHACBF (ORCPT ); Mon, 31 Jul 2017 22:01:05 -0400 Received: from mail-pg0-f68.google.com ([74.125.83.68]:34335 "EHLO mail-pg0-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751717AbdHACBD (ORCPT ); Mon, 31 Jul 2017 22:01:03 -0400 Date: Tue, 1 Aug 2017 12:00:47 +1000 From: Nicholas Piggin To: Mathieu Desnoyers Cc: Michael Ellerman , Peter Zijlstra , "Paul E. McKenney" , linux-kernel , Boqun Feng , Andrew Hunter , maged michael , gromer , Avi Kivity , Benjamin Herrenschmidt , Palmer Dabbelt , Dave Watson Subject: Re: [RFC PATCH v2] membarrier: expedited private command Message-ID: <20170801120047.61c59064@roar.ozlabs.ibm.com> In-Reply-To: <973223324.694.1501551189603.JavaMail.zimbra@efficios.com> References: <20170727211314.32666-1-mathieu.desnoyers@efficios.com> <20170728085532.ylhuz2irwmgpmejv@hirez.programming.kicks-ass.net> <20170728115702.5vgnvwhmbbmyrxbf@hirez.programming.kicks-ass.net> <87tw1s4u9w.fsf@concordia.ellerman.id.au> <20170731233731.32e68f6d@roar.ozlabs.ibm.com> <973223324.694.1501551189603.JavaMail.zimbra@efficios.com> Organization: IBM X-Mailer: Claws Mail 3.14.1 (GTK+ 2.24.31; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6336 Lines: 154 On Tue, 1 Aug 2017 01:33:09 +0000 (UTC) Mathieu Desnoyers wrote: > ----- On Jul 31, 2017, at 8:35 PM, Nicholas Piggin npiggin@gmail.com wrote: > > > On Mon, 31 Jul 2017 23:20:59 +1000 > > Michael Ellerman wrote: > > > >> Peter Zijlstra writes: > >> > >> > On Fri, Jul 28, 2017 at 10:55:32AM +0200, Peter Zijlstra wrote: > >> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c > >> >> index e9785f7aed75..33f34a201255 100644 > >> >> --- a/kernel/sched/core.c > >> >> +++ b/kernel/sched/core.c > >> >> @@ -2641,8 +2641,18 @@ static struct rq *finish_task_switch(struct task_struct > >> >> *prev) > >> >> finish_arch_post_lock_switch(); > >> >> > >> >> fire_sched_in_preempt_notifiers(current); > >> >> + > >> >> + /* > >> >> + * For CONFIG_MEMBARRIER we need a full memory barrier after the > >> >> + * rq->curr assignment. Not all architectures have one in either > >> >> + * switch_to() or switch_mm() so we use (and complement) the one > >> >> + * implied by mmdrop()'s atomic_dec_and_test(). > >> >> + */ > >> >> if (mm) > >> >> mmdrop(mm); > >> >> + else if (IS_ENABLED(CONFIG_MEMBARRIER)) > >> >> + smp_mb(); > >> >> + > >> >> if (unlikely(prev_state == TASK_DEAD)) { > >> >> if (prev->sched_class->task_dead) > >> >> prev->sched_class->task_dead(prev); > >> >> > >> >> > >> > > >> >> a whole bunch of architectures don't in fact need this extra barrier at all. > >> > > >> > In fact, I'm fairly sure its only PPC. > >> > > >> > Because only ARM64 and PPC actually implement ACQUIRE/RELEASE with > >> > anything other than smp_mb() (for now, Risc-V is in this same boat and > >> > MIPS could be if they ever sort out their fancy barriers). > >> > > >> > TSO archs use a regular STORE for RELEASE, but all their atomics imply a > >> > smp_mb() and there are enough around to make one happen (typically > >> > mm_cpumask updates). > >> > > >> > Everybody else, aside from ARM64 and PPC must use smp_mb() for > >> > ACQUIRE/RELEASE. > >> > > >> > ARM64 has a super duper barrier in switch_to(). > >> > > >> > Which only leaves PPC stranded.. but the 'good' news is that mpe says > >> > they'll probably need a barrier in switch_mm() in any case. > >> > >> I may have been sleep deprived. We have a patch, probably soon to be > >> merged, which will add a smp_mb() in switch_mm() but *only* when we add > >> a CPU to mm_cpumask, ie. when we run on a CPU we haven't run on before. > >> > >> I'm not across membarrier enough to know if that's sufficient, but it > >> seems unlikely? > > > > Won't be sufficient, they need a barrier after assigning rq->curr. > > It can be avoided when switching between threads with the same mm. > > > > I would like to see how bad membarrier performance is if we made > > that side heavy enough to avoid the barrier in context switch (e.g., > > by taking the rq locks, or using synchronize_sched_expedited -- on > > a per-arch basis of course). > > > > Is there some (realistic-ish) benchmark using membarrier we can > > experiment with? > > Hi Nick, > > I have benchmark programs using membarrier in liburcu [1]. They > represent very heavy usage of membarrier (perhaps more extreme than > reality). > > The master branch is at: > > git://git.liburcu.org/userspace-rcu.git > > Below is a test branch I quickly hacked to make urcu use membarrier > MEMBARRIER_CMD_PRIVATE_EXPEDITED: > > https://github.com/compudj/userspace-rcu-dev/tree/test-mb-private > > An example benchmark is under tests/benchmark > > Usage: test_urcu nr_readers nr_writers duration (s) > > Here is an example on my 16-core Intel machine (with hyperthreading, so don't > expect this benchmark to be very reliable) with 4 reader threads, 1 writer > thread. The writer thread in this benchmark issues a synchronize_rcu() at > each iteration, so typically 2 sys_membarrier calls. I'm compiling with > the configure flag "--disable-sys-membarrier-fallback" to have the full > read-side speed. > > Using MEMBARRIER_CMD_SHARED: > > $ ./test_urcu 4 1 10 > SUMMARY [...] test_urcu testdur 10 nr_readers 4 rdur 0 wdur 0 nr_writers 1 wdelay 0 nr_reads 17522991852 nr_writes 249 nr_ops 17522992101 > > 438'074'796 reads / reader thread / s > 25 updates / s > > Using MEMBARRIER_CMD_PRIVATE_EXPEDITED: > > $ ./test_urcu 4 1 10 > SUMMARY [...] testdur 10 nr_readers 4 rdur 0 wdur 0 nr_writers 1 wdelay 0 nr_reads 12709653290 nr_writes 799668 nr_ops 12710452958 > > 317'741'332 reads / reader thread / s > 79'966 updates / s > > As a comparison, refer to the test_urcu_mb benchmark, which uses > a full memory barrier in rcu_read_lock and rcu_read_unlock, and > can therefore do without any sys_membarrier. This is our fallback > when sys_membarrier returns -ENOSYS. > > $ ./test_urcu_mb 4 1 10 > SUMMARY [...] test_urcu_mb testdur 10 nr_readers 4 rdur 0 wdur 0 nr_writers 1 wdelay 0 nr_reads 137290634 nr_writes 8624184 nr_ops 145914818 > > 3'432'265 reads / reader thread / s > 862'418 updates / s > > You could try adapting my membarrier private-expedited patch to grab the > rq lock for each CPU to see the impact on the benchmark for your specific > architecture. I suspect that the overhead that will matter in that case > is not only the speed of the membarrier operation, but also its disruption > of the scheduler rq lock cachelines, which will have effect on other unrelated > workloads. You may need to benchmark another scheduler-intensive workload > executed in parallel with heavy use of sys_membarrier adapted to grab rq locks > to figure this one out. > > Let me know if you have questions, Thanks for this, I'll take a look. This should be a good start as a stress test, but I'd also be interested in some application. The reason being that for example using runqueue locks may give reasonable maximum throughput numbers, but could cause some latency or slowdown when it's used in more realistic scenario. We don't have numbers for smp_mb in switch_mm. As you say that should be a lower overhead, but probably still undesirable. But that's one of the things I'm looking at. Will report back after I've tried a few more things. Thanks, Nick