Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752947AbcKRXNx (ORCPT ); Fri, 18 Nov 2016 18:13:53 -0500 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:58245 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752389AbcKRXNv (ORCPT ); Fri, 18 Nov 2016 18:13:51 -0500 Date: Fri, 18 Nov 2016 15:13:45 -0800 From: "Paul E. McKenney" To: Lance Roy Cc: linux-kernel@vger.kernel.org, mingo@kernel.org, jiangshanlai@gmail.com, dipankar@in.ibm.com, akpm@linux-foundation.org, mathieu.desnoyers@efficios.com, josh@joshtriplett.org, tglx@linutronix.de, peterz@infradead.org, rostedt@goodmis.org, dhowells@redhat.com, edumazet@google.com, dvhart@linux.intel.com, fweisbec@gmail.com, oleg@redhat.com, bobby.prani@gmail.com Subject: Re: [RFC PATCH] SRCU: More efficient reader counts. Reply-To: paulmck@linux.vnet.ibm.com References: <20161117220335.20177-1-ldr709@gmail.com> <20161118140845.GQ3612@linux.vnet.ibm.com> <20161118123300.3eda4a78@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20161118123300.3eda4a78@gmail.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 X-Content-Scanned: Fidelis XPS MAILER x-cbid: 16111823-0020-0000-0000-00000A4BB8BF X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00006101; HX=3.00000240; KW=3.00000007; PH=3.00000004; SC=3.00000189; SDB=6.00782596; UDB=6.00377659; IPR=6.00560065; BA=6.00004892; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00013373; XFM=3.00000011; UTC=2016-11-18 23:13:49 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 16111823-0021-0000-0000-0000576618C9 Message-Id: <20161118231345.GA3612@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2016-11-18_14:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1609300000 definitions=main-1611180394 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6182 Lines: 118 On Fri, Nov 18, 2016 at 12:33:00PM -0800, Lance Roy wrote: > On Fri, 18 Nov 2016 06:08:45 -0800 > "Paul E. McKenney" wrote: > > However, let's first take a look at the overflow issue. > > > > If a given program could have ULONG_MAX or more readers at any given > > time, there would of course be overflow. However, each read must have > > an srcu_read_lock() outstanding, and the resulting four-byte return > > value must be stored somewhere. Because the full address space is at > > most ULONG_MAX, the maximum number of outstanding readers is at most > > ULONG_MAX/4, even in the degenerate case where a single CPU/task invokes > > srcu_read_lock() in a tight loop. And even this assumes that the entire > > address space can somehow be devoted to srcu_read_lock() return values. > > ULONG_MAX/4 is therefore a hard upper bound on the number of outstanding > > SRCU readers. > I agree that there should be at most ULONG_MAX/4 active readers at a time, as > long as the compiler doesn't do something crazy like noticing that > srcu_read_lock() always returns 0 or 1 and then packing the return values into > smaller variables. > > > Now srcu_readers_active_idx_check() checks for strict equality between > > the number of locks and unlocks, so we can in theory tolerate ULONG_MAX-1 > > readers. So, the question is whether ULONG_MAX/4 readers can result > > in the updater seeing ULONG_MAX reads, due to memory misordering and > > other issues. > > > > Because there are no memory barriers surrounding srcu_flip(), the updater > > could miss an extremely large number of srcu_read_unlock()s. However, > > each missed srcu_read_unlock() must have a corresponding srcu_read_lock(), > > and there is a full memory barrier between between the srcu_flip() and > > the read of the lock count. There is also a full barrier between any > > srcu_read_lock()'s increment of the lock count and that CPU's/task's next > > srcu_read_lock()'s fetch of the index. Therefore, if the updater misses > > counting a given srcu_read_lock(), that CPU's/task's next srcu_read_lock() > > must see the new value of the index. Because srcu_read_lock() disables > > preemption across the index fetch and the lock increment, there can be at > > most NR_CPUS-1 srcu_read_lock() calls that missed the recent srcu_flip()'s > > update to the index. (I said NR_CPUS earlier, but Mathieu is correct > > in pointing out that srcu_flip() has to run somewhere.) > The trouble is that disabling preemption is not enough to ensure that there > is at most one srcu_read_lock() call per CPU that missed the srcu_flip(). > > Define a reader to be an SRCU lock+unlock pair. A reader is called active if it > has incremented ->lock_count[] but hasn't incremented ->unlock_count[] yet, and > completed if it has incremented ->unlock_count[]. I think that we only want to > limit the number of active readers and the number of CPUs. In particular, I > don't think there should be a limit on the rate of completion of read side > critical section. > > The goal of srcu_readers_active_idx_check() is to verify that there were zero > active readers on the inactive index at some time during its execution. To do > this, it totals the unlock counts, executes a memory barrier, totals the lock > counts, and takes the difference. This difference counts the readers that are > active when srcu_readers_lock_idx() gets to their CPU, plus the readers that > completed after srcu_readers_unlock_idx() and before srcu_readers_lock_idx(). > If the true (infinite precision) value of the difference is zero, then there > were no active readers at some point while srcu_readers_lock_idx() is running. > However, the difference is only stored in a long, so there is a potential for > overflow if too many readers complete during srcu_readers_active_idx_check(). > > For example, let's say there are three threads, each running on their own CPU: > > int data, flag; > struct srcu_struct *sp = /* ... */; > > Thread 0: > data = 1; > synchronize_srcu(sp); > flag = 1; > > Thread 1: > int data_r, flag_r; > int idx = srcu_read_lock(sp); > data_r = data; > flag_r = flag; > srcu_read_unlock(sp, idx); > BUG_ON(flag_r == 1 && data_r == 0); > > Thread 2: > while (true) { > int idx = srcu_read_lock(sp); > srcu_read_unlock(sp, idx); > } > > Let's take the following execution order. Thread 1 increments > the CPU 1 version of sp->lock_count[0], sets idx to zero, and loads data (0) > into data_r. Thread 0 then sets data to be 1, verifies that there are no > readers on index 1, and increments sp->completed, but the CPU actually doesn't > preform the last operation, putting it off until the next memory barrier. Thread > 0 then calls srcu_readers_active_idx_check() on index 0, which runs > srcu_readers_unlock_idx() (returning 0). Right after srcu_readers_unlock_idx() > completes, thread 2 starts running. Since Thread 0 hasn't actually incremented > sp->completed in any way that is visible to thread 2, srcu_read_lock() will > still return 0. Thread 2 can then run for ULONG_MAX iterations, setting > the CPU 2 version of sp->unlock_count[0] to ULONG_MAX. CPU 0 then finally gets > around to incrementing sp->completed, runs its memory barrier, and then reads > the lock counts: 1, 0, ULONG_MAX. The total of ULONG_MAX+1 will overflow to 0 > and compare equal with earlier unlock count. Thread 0 will then think that the > grace period is over and set flag to one. Thread 1 can then read flag (1) into > flag_r and run srcu_read_unlock(). The BUG_ON statement will then fail. > > Although ULONG_MAX readers completed during srcu_readers_active_idx_check(), > there were at most 2 active readers at any time, so this program doesn't run > into any limit. > > I hope that was clear enough. Indeed it is! So adding a full memory barrier immediately after the srcu_flip() should prevent this, because if the updater failed to see an unlock increment, the second following lock for that CPU/task would be guaranteed to see the flip. Or am I still missing something? Is there a sequence of events that requires a full memory barrier before the srcu_flip()? Thanx, Paul