Date:   Sat, 28 Jan 2023 10:24:40 -0800
From:   "Paul E. McKenney" <paulmck@kernel.org>
To:     "Joel Fernandes (Google)" <joel@joelfernandes.org>
Cc:     linux-kernel@vger.kernel.org,
        Frederic Weisbecker <frederic@kernel.org>,
        Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
        Boqun Feng <boqun.feng@gmail.com>,
        Josh Triplett <josh@joshtriplett.org>,
        Lai Jiangshan <jiangshanlai@gmail.com>, rcu@vger.kernel.org,
        Steven Rostedt <rostedt@goodmis.org>
Subject: Re: [PATCH v4] srcu: Clarify comments on memory barrier "E"
Message-ID: <20230128182440.GA2948950@paulmck-ThinkPad-P17-Gen-1>
Reply-To: paulmck@kernel.org
References: <20230128035902.1758726-1-joel@joelfernandes.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20230128035902.1758726-1-joel@joelfernandes.org>
Precedence: bulk

On Sat, Jan 28, 2023 at 03:59:01AM +0000, Joel Fernandes (Google) wrote:
> During a flip, we have a full memory barrier before srcu_idx is incremented.
> 
> The idea is we intend to order the first phase scan's read of lock
> counters with the flipping of the index.
> 
> However, such ordering is already enforced because of the
> control-dependency between the 2 scans. We would be flipping the index
> only if lock and unlock counts matched.
> 
> But such match will not happen if there was a pending reader before the flip
> in the first place (observation courtesy Mathieu Desnoyers).
> 
> The litmus test below shows this:
> (test courtesy Frederic Weisbecker, Changes for ctrldep by Boqun/me):

Much better, thank you!

I of course did the usual wordsmithing, as shown below.  Does this
version capture your intent and understanding?

							Thanx, Paul

------------------------------------------------------------------------

commit 963f34624beb2af1ec08527e637d16ab6a1dacbd
Author: Joel Fernandes (Google) <joel@joelfernandes.org>
Date:   Sat Jan 28 03:59:01 2023 +0000

    srcu: Clarify comments on memory barrier "E"
    
    There is an smp_mb() named "E" in srcu_flip() immediately before the
    increment (flip) of the srcu_struct structure's ->srcu_idx.
    
    The purpose of E is to order the preceding scan's read of lock counters
    against the flipping of the ->srcu_idx, in order to prevent new readers
    from continuing to use the old ->srcu_idx value, which might needlessly
    extend the grace period.
    
    However, this ordering is already enforced because of the control
    dependency between the preceding scan and the ->srcu_idx flip.
    This control dependency exists because atomic_long_read() is used
    to scan the counts, because WRITE_ONCE() is used to flip ->srcu_idx,
    and because ->srcu_idx is not flipped until the ->srcu_lock_count[] and
    ->srcu_unlock_count[] counts match.  And such a match cannot happen when
    there is an in-flight reader that started before the flip (observation
    courtesy Mathieu Desnoyers).
    
    The litmus test below (courtesy of Frederic Weisbecker, with changes
    for ctrldep by Boqun and Joel) shows this:
    
    C srcu
    (*
     * bad condition: P0's first scan (SCAN1) saw P1's idx=0 LOCK count inc, though P1 saw flip.
     *
     * So basically, the ->po ordering on both P0 and P1 is enforced via ->ppo
     * (control deps) on both sides, and both P0 and P1 are interconnected by ->rf
     * relations. Combining the ->ppo with ->rf, a cycle is impossible.
     *)
    
    {}
    
    // updater
    P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
    {
            int lock1;
            int unlock1;
            int lock0;
            int unlock0;
    
            // SCAN1
            unlock1 = READ_ONCE(*UNLOCK1);
            smp_mb(); // A
            lock1 = READ_ONCE(*LOCK1);
    
            // FLIP
            if (lock1 == unlock1) {   // Control dep
                    smp_mb(); // E    // Remove E and still passes.
                    WRITE_ONCE(*IDX, 1);
                    smp_mb(); // D
    
                    // SCAN2
                    unlock0 = READ_ONCE(*UNLOCK0);
                    smp_mb(); // A
                    lock0 = READ_ONCE(*LOCK0);
            }
    }
    
    // reader
    P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
    {
            int tmp;
            int idx1;
            int idx2;
    
            // 1st reader
            idx1 = READ_ONCE(*IDX);
            if (idx1 == 0) {         // Control dep
                    tmp = READ_ONCE(*LOCK0);
                    WRITE_ONCE(*LOCK0, tmp + 1);
                    smp_mb(); /* B and C */
                    tmp = READ_ONCE(*UNLOCK0);
                    WRITE_ONCE(*UNLOCK0, tmp + 1);
            } else {
                    tmp = READ_ONCE(*LOCK1);
                    WRITE_ONCE(*LOCK1, tmp + 1);
                    smp_mb(); /* B and C */
                    tmp = READ_ONCE(*UNLOCK1);
                    WRITE_ONCE(*UNLOCK1, tmp + 1);
            }
    }
    
    exists (0:lock1=1 /\ 1:idx1=1)
    
    More complicated litmus tests with multiple SRCU readers also show that
    memory barrier E is not needed.
    
    This commit therefore clarifies the comment on memory barrier E.
    
    Why not also remove that redundant smp_mb()?
    
    Because control dependencies are quite fragile due to their not being
    recognized by most compilers and tools.  Control dependencies therefore
    exact an ongoing maintenance burden, and such a burden cannot be justified
    in this slowpath.  Therefore, that smp_mb() stays until such time as
    its overhead becomes a measurable problem in a real workload running on
    a real production system, or until such time as compilers start paying
    attention to this sort of control dependency.
    
    Co-developed-by: Frederic Weisbecker <frederic@kernel.org>
    Co-developed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Co-developed-by: Boqun Feng <boqun.feng@gmail.com>
    Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index c541b82646b63..cd46fe063e50f 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -1085,16 +1085,36 @@ static bool try_check_zero(struct srcu_struct *ssp, int idx, int trycount)
 static void srcu_flip(struct srcu_struct *ssp)
 {
 	/*
-	 * Ensure that if this updater saw a given reader's increment
-	 * from __srcu_read_lock(), that reader was using an old value
-	 * of ->srcu_idx.  Also ensure that if a given reader sees the
-	 * new value of ->srcu_idx, this updater's earlier scans cannot
-	 * have seen that reader's increments (which is OK, because this
-	 * grace period need not wait on that reader).
+	 * Because the flip of ->srcu_idx is executed only if the
+	 * preceding call to srcu_readers_active_idx_check() found that
+	 * the ->srcu_unlock_count[] and ->srcu_lock_count[] sums matched
+	 * and because that summing uses atomic_long_read(), there is
+	 * ordering due to a control dependency between that summing and
+	 * the WRITE_ONCE() in this call to srcu_flip().  This ordering
+	 * ensures that if this updater saw a given reader's increment from
+	 * __srcu_read_lock(), that reader was using a value of ->srcu_idx
+	 * from before the previous call to srcu_flip(), which should be
+	 * quite rare.  This ordering thus helps forward progress because
+	 * the grace period could otherwise be delayed by additional
+	 * calls to __srcu_read_lock() using that old (soon to be new)
+	 * value of ->srcu_idx.
+	 *
+	 * This sum-equality check and ordering also ensures that if
+	 * a given call to __srcu_read_lock() uses the new value of
+	 * ->srcu_idx, this updater's earlier scans cannot have seen
+	 * that reader's increments, which is all to the good, because
+	 * this grace period need not wait on that reader.  After all,
+	 * if those earlier scans had seen that reader, there would have
+	 * been a sum mismatch and this code would not be reached.
+	 *
+	 * This means that the following smp_mb() is redundant, but
+	 * it stays until either (1) Compilers learn about this sort of
+	 * control dependency or (2) Some production workload running on
+	 * a production system is unduly delayed by this slowpath smp_mb().
 	 */
 	smp_mb(); /* E */  /* Pairs with B and C. */
 
-	WRITE_ONCE(ssp->srcu_idx, ssp->srcu_idx + 1);
+	WRITE_ONCE(ssp->srcu_idx, ssp->srcu_idx + 1); // Flip the counter.
 
 	/*
 	 * Ensure that if the updater misses an __srcu_read_unlock()