Date: Fri, 18 Nov 2016 16:54:42 -0800
From: Lance Roy <ldr709@gmail.com>
To: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: linux-kernel@vger.kernel.org, mingo@kernel.org, jiangshanlai@gmail.com,
        dipankar@in.ibm.com, akpm@linux-foundation.org,
        mathieu.desnoyers@efficios.com, josh@joshtriplett.org,
        tglx@linutronix.de, peterz@infradead.org, rostedt@goodmis.org,
        dhowells@redhat.com, edumazet@google.com, dvhart@linux.intel.com,
        fweisbec@gmail.com, oleg@redhat.com, bobby.prani@gmail.com
Subject: Re: [RFC PATCH] SRCU: More efficient reader counts.
Message-ID: <20161118165442.388adede@gmail.com>
In-Reply-To: <20161118231345.GA3612@linux.vnet.ibm.com>
References: <20161117220335.20177-1-ldr709@gmail.com>
        <20161118140845.GQ3612@linux.vnet.ibm.com>
        <20161118123300.3eda4a78@gmail.com>
        <20161118231345.GA3612@linux.vnet.ibm.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 5351
Lines: 106

On Fri, 18 Nov 2016 15:13:45 -0800
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> On Fri, Nov 18, 2016 at 12:33:00PM -0800, Lance Roy wrote:  
> > The trouble is that disabling preemption is not enough to ensure that there
> > is at most one srcu_read_lock() call per CPU that missed the srcu_flip().
> > 
> > Define a reader to be an SRCU lock+unlock pair. A reader is called active
> > if it has incremented ->lock_count[] but hasn't incremented  
> > ->unlock_count[] yet, and completed if it has incremented
> > ->unlock_count[]. I think that we only want to limit the number of active
> > readers and the number of CPUs. In particular, I don't think there should
> > be a limit on the rate of completion of read side critical section.
> > 
> > The goal of srcu_readers_active_idx_check() is to verify that there were
> > zero active readers on the inactive index at some time during its
> > execution. To do this, it totals the unlock counts, executes a memory
> > barrier, totals the lock counts, and takes the difference. This difference
> > counts the readers that are active when srcu_readers_lock_idx() gets to
> > their CPU, plus the readers that completed after srcu_readers_unlock_idx()
> > and before srcu_readers_lock_idx(). If the true (infinite precision) value
> > of the difference is zero, then there were no active readers at some point
> > while srcu_readers_lock_idx() is running. However, the difference is only
> > stored in a long, so there is a potential for overflow if too many readers
> > complete during srcu_readers_active_idx_check().
> > 
> > For example, let's say there are three threads, each running on their own
> > CPU:
> > 
> > int data, flag;
> > struct srcu_struct *sp = /* ... */;
> > 
> > Thread 0:
> > 	data = 1;
> > 	synchronize_srcu(sp);
> > 	flag = 1;
> > 
> > Thread 1:
> > 	int data_r, flag_r;
> > 	int idx = srcu_read_lock(sp);
> > 	data_r = data;
> > 	flag_r = flag;
> > 	srcu_read_unlock(sp, idx);
> > 	BUG_ON(flag_r == 1 && data_r == 0);
> > 
> > Thread 2:
> > 	while (true) {
> > 		int idx = srcu_read_lock(sp);
> > 		srcu_read_unlock(sp, idx);
> > 	}
> > 
> > Let's take the following execution order. Thread 1 increments
> > the CPU 1 version of sp->lock_count[0], sets idx to zero, and loads data (0)
> > into data_r. Thread 0 then sets data to be 1, verifies that there are no
> > readers on index 1, and increments sp->completed, but the CPU actually
> > doesn't preform the last operation, putting it off until the next memory
> > barrier. Thread 0 then calls srcu_readers_active_idx_check() on index 0,
> > which runs srcu_readers_unlock_idx() (returning 0). Right after
> > srcu_readers_unlock_idx() completes, thread 2 starts running. Since Thread
> > 0 hasn't actually incremented sp->completed in any way that is visible to
> > thread 2, srcu_read_lock() will still return 0. Thread 2 can then run for
> > ULONG_MAX iterations, setting the CPU 2 version of sp->unlock_count[0] to
> > ULONG_MAX. CPU 0 then finally gets around to incrementing sp->completed,
> > runs its memory barrier, and then reads the lock counts: 1, 0, ULONG_MAX.
> > The total of ULONG_MAX+1 will overflow to 0 and compare equal with earlier
> > unlock count. Thread 0 will then think that the grace period is over and
> > set flag to one. Thread 1 can then read flag (1) into flag_r and run
> > srcu_read_unlock(). The BUG_ON statement will then fail.
> > 
> > Although ULONG_MAX readers completed during srcu_readers_active_idx_check(),
> > there were at most 2 active readers at any time, so this program doesn't run
> > into any limit.
> > 
> > I hope that was clear enough.    
> 
> Indeed it is!
> 
> So adding a full memory barrier immediately after the srcu_flip() should
> prevent this, because if the updater failed to see an unlock increment,
> the second following lock for that CPU/task would be guaranteed to see
> the flip.  Or am I still missing something?  
Yes, adding a full memory barrier after srcu_flip() prevents this problem.

> Is there a sequence of events that requires a full memory barrier
> before the srcu_flip()?  
I am now unsure if a memory barrier before srcu_flip() is necessary. I thought
that it would be needed to prevent the CPU from preforming the increment early,
but I have just noticed that srcu_advance_batches() will return early if the
first try_check_zero() fails, creating a control dependency. I think that this
control dependency should be enough to prevent the problem from occurring.

One interesting thing about this version is that there is only an address
dependency between the load of ->completed and the increment of ->load_count[].
This means that without an smp_read_barrier_depends() between the two, a reader
could use the new value of ->completed to increment ->load_count[], before
actually reading ->completed. (I was surprised that this ordering doesn't come
for free, as it seems like violating it would require speculative writes.) It
doesn't matter much if the dependency barrier is actually there, because
leaving it out just means at most 1 more spurious increment per CPU.

I think that with the added synchronization in srcu_flip() there would be at
most 3 * NR_CPUS readers that start and complete inside one
srcu_readers_active_idx_check(). With the read side dependency barrier this
would drop to 2 * NR_CPUS.

Thanks,
Lance