Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751659AbdHABdS (ORCPT ); Mon, 31 Jul 2017 21:33:18 -0400 Received: from mail.efficios.com ([167.114.142.141]:48177 "EHLO mail.efficios.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751199AbdHABdR (ORCPT ); Mon, 31 Jul 2017 21:33:17 -0400 Date: Tue, 1 Aug 2017 01:33:09 +0000 (UTC) From: Mathieu Desnoyers To: Nicholas Piggin Cc: Michael Ellerman , Peter Zijlstra , "Paul E. McKenney" , linux-kernel , Boqun Feng , Andrew Hunter , maged michael , gromer , Avi Kivity , Benjamin Herrenschmidt , Palmer Dabbelt , Dave Watson Message-ID: <973223324.694.1501551189603.JavaMail.zimbra@efficios.com> In-Reply-To: <20170731233731.32e68f6d@roar.ozlabs.ibm.com> References: <20170727211314.32666-1-mathieu.desnoyers@efficios.com> <20170728085532.ylhuz2irwmgpmejv@hirez.programming.kicks-ass.net> <20170728115702.5vgnvwhmbbmyrxbf@hirez.programming.kicks-ass.net> <87tw1s4u9w.fsf@concordia.ellerman.id.au> <20170731233731.32e68f6d@roar.ozlabs.ibm.com> Subject: Re: [RFC PATCH v2] membarrier: expedited private command MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [167.114.142.141] X-Mailer: Zimbra 8.7.9_GA_1794 (ZimbraWebClient - FF52 (Linux)/8.7.9_GA_1794) Thread-Topic: membarrier: expedited private command Thread-Index: X02TjWJ2Wq2hcpHm4udbd0k4V6wQpQ== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5581 Lines: 155 ----- On Jul 31, 2017, at 8:35 PM, Nicholas Piggin npiggin@gmail.com wrote: > On Mon, 31 Jul 2017 23:20:59 +1000 > Michael Ellerman wrote: > >> Peter Zijlstra writes: >> >> > On Fri, Jul 28, 2017 at 10:55:32AM +0200, Peter Zijlstra wrote: >> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >> >> index e9785f7aed75..33f34a201255 100644 >> >> --- a/kernel/sched/core.c >> >> +++ b/kernel/sched/core.c >> >> @@ -2641,8 +2641,18 @@ static struct rq *finish_task_switch(struct task_struct >> >> *prev) >> >> finish_arch_post_lock_switch(); >> >> >> >> fire_sched_in_preempt_notifiers(current); >> >> + >> >> + /* >> >> + * For CONFIG_MEMBARRIER we need a full memory barrier after the >> >> + * rq->curr assignment. Not all architectures have one in either >> >> + * switch_to() or switch_mm() so we use (and complement) the one >> >> + * implied by mmdrop()'s atomic_dec_and_test(). >> >> + */ >> >> if (mm) >> >> mmdrop(mm); >> >> + else if (IS_ENABLED(CONFIG_MEMBARRIER)) >> >> + smp_mb(); >> >> + >> >> if (unlikely(prev_state == TASK_DEAD)) { >> >> if (prev->sched_class->task_dead) >> >> prev->sched_class->task_dead(prev); >> >> >> >> >> > >> >> a whole bunch of architectures don't in fact need this extra barrier at all. >> > >> > In fact, I'm fairly sure its only PPC. >> > >> > Because only ARM64 and PPC actually implement ACQUIRE/RELEASE with >> > anything other than smp_mb() (for now, Risc-V is in this same boat and >> > MIPS could be if they ever sort out their fancy barriers). >> > >> > TSO archs use a regular STORE for RELEASE, but all their atomics imply a >> > smp_mb() and there are enough around to make one happen (typically >> > mm_cpumask updates). >> > >> > Everybody else, aside from ARM64 and PPC must use smp_mb() for >> > ACQUIRE/RELEASE. >> > >> > ARM64 has a super duper barrier in switch_to(). >> > >> > Which only leaves PPC stranded.. but the 'good' news is that mpe says >> > they'll probably need a barrier in switch_mm() in any case. >> >> I may have been sleep deprived. We have a patch, probably soon to be >> merged, which will add a smp_mb() in switch_mm() but *only* when we add >> a CPU to mm_cpumask, ie. when we run on a CPU we haven't run on before. >> >> I'm not across membarrier enough to know if that's sufficient, but it >> seems unlikely? > > Won't be sufficient, they need a barrier after assigning rq->curr. > It can be avoided when switching between threads with the same mm. > > I would like to see how bad membarrier performance is if we made > that side heavy enough to avoid the barrier in context switch (e.g., > by taking the rq locks, or using synchronize_sched_expedited -- on > a per-arch basis of course). > > Is there some (realistic-ish) benchmark using membarrier we can > experiment with? Hi Nick, I have benchmark programs using membarrier in liburcu [1]. They represent very heavy usage of membarrier (perhaps more extreme than reality). The master branch is at: git://git.liburcu.org/userspace-rcu.git Below is a test branch I quickly hacked to make urcu use membarrier MEMBARRIER_CMD_PRIVATE_EXPEDITED: https://github.com/compudj/userspace-rcu-dev/tree/test-mb-private An example benchmark is under tests/benchmark Usage: test_urcu nr_readers nr_writers duration (s) Here is an example on my 16-core Intel machine (with hyperthreading, so don't expect this benchmark to be very reliable) with 4 reader threads, 1 writer thread. The writer thread in this benchmark issues a synchronize_rcu() at each iteration, so typically 2 sys_membarrier calls. I'm compiling with the configure flag "--disable-sys-membarrier-fallback" to have the full read-side speed. Using MEMBARRIER_CMD_SHARED: $ ./test_urcu 4 1 10 SUMMARY [...] test_urcu testdur 10 nr_readers 4 rdur 0 wdur 0 nr_writers 1 wdelay 0 nr_reads 17522991852 nr_writes 249 nr_ops 17522992101 438'074'796 reads / reader thread / s 25 updates / s Using MEMBARRIER_CMD_PRIVATE_EXPEDITED: $ ./test_urcu 4 1 10 SUMMARY [...] testdur 10 nr_readers 4 rdur 0 wdur 0 nr_writers 1 wdelay 0 nr_reads 12709653290 nr_writes 799668 nr_ops 12710452958 317'741'332 reads / reader thread / s 79'966 updates / s As a comparison, refer to the test_urcu_mb benchmark, which uses a full memory barrier in rcu_read_lock and rcu_read_unlock, and can therefore do without any sys_membarrier. This is our fallback when sys_membarrier returns -ENOSYS. $ ./test_urcu_mb 4 1 10 SUMMARY [...] test_urcu_mb testdur 10 nr_readers 4 rdur 0 wdur 0 nr_writers 1 wdelay 0 nr_reads 137290634 nr_writes 8624184 nr_ops 145914818 3'432'265 reads / reader thread / s 862'418 updates / s You could try adapting my membarrier private-expedited patch to grab the rq lock for each CPU to see the impact on the benchmark for your specific architecture. I suspect that the overhead that will matter in that case is not only the speed of the membarrier operation, but also its disruption of the scheduler rq lock cachelines, which will have effect on other unrelated workloads. You may need to benchmark another scheduler-intensive workload executed in parallel with heavy use of sys_membarrier adapted to grab rq locks to figure this one out. Let me know if you have questions, Perhaps Dave Watson will have benchmarks on hazard pointers. Thanks, Mathieu [1] http://liburcu.org > > Thanks, > Nick -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com