Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753440AbdDMQu3 (ORCPT ); Thu, 13 Apr 2017 12:50:29 -0400 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:36832 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1750730AbdDMQu0 (ORCPT ); Thu, 13 Apr 2017 12:50:26 -0400 Date: Thu, 13 Apr 2017 09:50:19 -0700 From: "Paul E. McKenney" To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, mingo@kernel.org, jiangshanlai@gmail.com, dipankar@in.ibm.com, akpm@linux-foundation.org, mathieu.desnoyers@efficios.com, josh@joshtriplett.org, tglx@linutronix.de, rostedt@goodmis.org, dhowells@redhat.com, edumazet@google.com, fweisbec@gmail.com, oleg@redhat.com, bobby.prani@gmail.com Subject: Re: [PATCH tip/core/rcu 40/40] srcu: Parallelize callback handling Reply-To: paulmck@linux.vnet.ibm.com References: <20170412174003.GA23207@linux.vnet.ibm.com> <1492018825-25634-40-git-send-email-paulmck@linux.vnet.ibm.com> <20170413095420.a2p2ygddz26gaugw@hirez.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170413095420.a2p2ygddz26gaugw@hirez.programming.kicks-ass.net> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 x-cbid: 17041316-0044-0000-0000-00000304E5BE X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00006930; HX=3.00000240; KW=3.00000007; PH=3.00000004; SC=3.00000208; SDB=6.00847067; UDB=6.00417886; IPR=6.00625500; BA=6.00005288; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00015033; XFM=3.00000013; UTC=2017-04-13 16:50:23 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17041316-0045-0000-0000-00000732E9E2 Message-Id: <20170413165019.GH3956@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-04-13_11:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1702020001 definitions=main-1704130142 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4313 Lines: 72 On Thu, Apr 13, 2017 at 11:54:20AM +0200, Peter Zijlstra wrote: > On Wed, Apr 12, 2017 at 10:40:25AM -0700, Paul E. McKenney wrote: > > Peter Zijlstra proposed using SRCU to reduce mmap_sem contention [1], > > however, there are workloads that could result in a high volume of > > concurrent invocations of call_srcu(), which with current SRCU would > > result in excessive lock contention on the srcu_struct structure's > > ->queue_lock, which protects SRCU's callback lists. This commit therefore > > moves SRCU to per-CPU callback lists, thus greatly reducing contention. > > > > Because a given SRCU instance no longer has a single centralized callback > > list, starting grace periods and invoking callbacks each require a bit > > more work. These are handled using an srcu_node tree that is in some ways > > similar to the rcu_node trees used by RCU-bh, RCU-preempt, and RCU-sched > > (for example, the srcu_node tree shape is controlled by exactly the > > same Kconfig options and boot parameters that control the shape of the > > rcu_node tree). > > > > In addition, the old per-CPU srcu_array structure is now named srcu_data > > and contains an rcu_segcblist structure named ->srcu_cblist for its > > callbacks (and a spinlock to protect this). The srcu_struct gets > > an srcu_gp_seq that is used to associate callback segments with the > > corresponding completion-time grace-period number. These completion-time > > grace-period numbers are propagated up the srcu_node tree so that the > > grace-period workqueue handler can determine whether additional grace > > periods are needed on the one hand and where to look for callbacks that > > are ready to be invoked. > > > > The srcu_barrier() function must now wait on all instances of the > > per-CPU ->srcu_cblist. Because each ->srcu_cblist is protected > > by ->lock, srcu_barrier() can remotely add the needed callbacks. > > In theory, it could also remotely start grace periods, but this gets > > complex and racy. And interestingly enough, it is never necessary to > > start a grace period in this case because srcu_barrier() only enqueues > > a callback when a callback is already present. And a grace period has > > to have already been started for this pre-existing callback. And it is > > only the callback that srcu_barrier() needs to wait on, not any particular > > grace period. Therefore, a new rcu_segcblist_entrain() function enqueues > > the srcu_barrier() function's callback into the same segment occupied by > > the pre-existing callback. The special case where all the pre-existing > > callbacks are on a different list being invoked is handled by enqueuing > > srcu_barrier()'s callback into the RCU_DONE_TAIL segment, relying on > > the done-callbacks check that takes place after all callbacks are inovked. > > > > Note that the readers use the same algorithm as before. Note that there > > is a separate srcu_idx that tells the readers what counter to increment. > > This unfortunately cannot be combined with srcu_gp_seq because they > > need to be incremented at different times. > > So one thing I've asked before I think, would it not be possible to > abstract PREEMPT_RCU and use the exact same code for PREEMPT_RCU and > SRCU ? I took a hard look at that some time ago, and it gets pretty ugly pretty quickly. Much of the PREEMPT_RCU code has the idea that there is only one global PREEMPT_RCU implementation baked deeply into it. For but one example, the handling of an arbitrarily large number of ->blkd_tasks lists at context-switch time would not be pretty, especially if the task in question blocked while in both a PREEMPT_RCU and in an SRCU read-side critical section. Or, worse yet, if it blocked while in several different SRCU read-side critical sections. It might be easier to go the other way and implement PREEMPT_RCU in terms of SRCU, but I don't believe that the read-side smp_mb() calls would make people happy. Plus there are use cases that would not be well-served by idle no longer being an extended quiescent state. And SRCU currently has inconvenient restrictions about use in interrupt and NMI handlers. It might well be that there is a global solution for all this, but in the meantime I am instead sharing common code and doing a bit of consolidation. Thanx, Paul