Date: Thu, 29 Jun 2017 11:11:26 -0700
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Alan Stern <stern@rowland.harvard.edu>
Cc: Will Deacon <will.deacon@arm.com>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Andrea Parri <parri.andrea@gmail.com>,
        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
        priyalee.kushwaha@intel.com,
        =?utf-8?Q?Stanis=C5=82aw?= Drozd <drozdziak1@gmail.com>,
        Arnd Bergmann <arnd@arndb.de>, ldr709@gmail.com,
        Thomas Gleixner <tglx@linutronix.de>,
        Peter Zijlstra <peterz@infradead.org>,
        Josh Triplett <josh@joshtriplett.org>, Nicolas Pitre <nico@linaro.org>,
        Krister Johansen <kjlx@templeofstupid.com>,
        Vegard Nossum <vegard.nossum@oracle.com>, dcb314@hotmail.com,
        Wu Fengguang <fengguang.wu@intel.com>,
        Frederic Weisbecker <fweisbec@gmail.com>,
        Rik van Riel <riel@redhat.com>, Steven Rostedt <rostedt@goodmis.org>,
        Ingo Molnar <mingo@kernel.org>, Luc Maranget <luc.maranget@inria.fr>,
        Jade Alglave <j.alglave@ucl.ac.uk>
Subject: Re: [GIT PULL rcu/next] RCU commits for 4.13
Reply-To: paulmck@linux.vnet.ibm.com
References: <20170629113848.GA18630@arm.com>
 <Pine.LNX.4.44L0.1706291132310.1571-100000@iolanthe.rowland.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <Pine.LNX.4.44L0.1706291132310.1571-100000@iolanthe.rowland.org>
User-Agent: Mutt/1.5.21 (2010-09-15)
Message-Id: <20170629181126.GA2393@linux.vnet.ibm.com>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 8809
Lines: 204

On Thu, Jun 29, 2017 at 11:59:27AM -0400, Alan Stern wrote:
> On Thu, 29 Jun 2017, Will Deacon wrote:
> 
> > [turns out I've not been on cc for this thread, but Jade pointed me to it
> >  and I see my name came up at some point!]
> > 
> > On Wed, Jun 28, 2017 at 05:05:46PM -0700, Linus Torvalds wrote:
> > > On Wed, Jun 28, 2017 at 4:54 PM, Paul E. McKenney
> > > <paulmck@linux.vnet.ibm.com> wrote:
> > > >
> > > > Linus, are you dead-set against defining spin_unlock_wait() to be
> > > > spin_lock + spin_unlock?  For example, is the current x86 implementation
> > > > of spin_unlock_wait() really a non-negotiable hard requirement?  Or
> > > > would you be willing to live with the spin_lock + spin_unlock semantics?
> > > 
> > > So I think the "same as spin_lock + spin_unlock" semantics are kind of insane.
> > > 
> > > One of the issues is that the same as "spin_lock + spin_unlock" is
> > > basically now architecture-dependent. Is it really the
> > > architecture-dependent ordering you want to define this as?
> > > 
> > > So I just think it's a *bad* definition. If somebody wants something
> > > that is exactly equivalent to spin_lock+spin_unlock, then dammit, just
> > > do *THAT*. It's completely pointless to me to define
> > > spin_unlock_wait() in those terms.
> > > 
> > > And if it's not equivalent to the *architecture* behavior of
> > > spin_lock+spin_unlock, then I think it should be descibed in terms
> > > that aren't about the architecture implementation (so you shouldn't
> > > describe it as "spin_lock+spin_unlock", you should describe it in
> > > terms of memory barrier semantics.
> > > 
> > > And if we really have to use the spin_lock+spinunlock semantics for
> > > this, then what is the advantage of spin_unlock_wait at all, if it
> > > doesn't fundamentally avoid some locking overhead of just taking the
> > > spinlock in the first place?
> > 
> > Just on this point -- the arm64 code provides the same ordering semantics
> > as you would get from a lock;unlock sequence, but we can optimise that
> > when compared to an actual lock;unlock sequence because we don't need to
> > wait in turn for our ticket. I suspect something similar could be done
> > if/when we move to qspinlocks.
> > 
> > Whether or not this is actually worth optimising is another question, but
> > it is worth noting that unlock_wait can be implemented more cheaply than
> > lock;unlock, whilst providing the same ordering guarantees (if that's
> > really what we want -- see my reply to Paul).
> > 
> > Simplicity tends to be my preference, so ripping this out would suit me
> > best ;)
> 
> It would be best to know:
> 
> 	(1). How spin_unlock_wait() is currently being used.
> 
> 	(2). What it was originally intended for.
> 
> Paul has done some research into (1).  He can correct me if I get this
> wrong...  Only a few (i.e., around one or two) of the usages don't seem
> to require the full spin_lock+spin_unlock semantics.  I go along with
> Linus; the places which really do want it to behave like
> spin_lock+spin_unlock should simply use spin_lock+spin_unlock.  There
> hasn't been any indication so far that the possible efficiency
> improvement Will mentions is at all important.
> 
> According to Paul, most of the other places don't need anything more
> than the acquire guarantee (any changes made in earlier critical
> sections will be visible to the code following spin_unlock_wait).  In
> which case, the semantics of spin_unlock_wait could be redefined in
> this simpler form.
> 
> Or we could literally replace all the existing definitions with 
> spin_lock+spin_unlock.  Would that be so terrible?

And here they are...

spin_unlock_wait():

o	drivers/ata/libata-eh.c ata_scsi_cmd_error_handler()
	spin_unlock_wait(ap->lock) in else-clause where then-clause has
	a full critical section for this same lock.  This use case could
	potentially require both acquire and release semantics.  (I am
	following up with the developers/maintainers, suggesting that
	they convert to spin_lock+spin_unlock if they need release
	semantics.)

	This is error-handling code, which should be rare, so
	spin_lock+spin_unlock should work fine here.  Probably shouldn't
	have bugged the maintainer, but email already sent.  :-/

o	ipc/sem.c exit_sem()
	This use case appears to need to wait only on prior critical
	sections, as the only way we get here is if the entry has already
	been removed from the list.  An acquire-only spin_unlock_wait()
	works here.  However, this is sem-exit code, which is not a
	fastpath, and the race should be rare, so spin_lock+spin_unlock
	should work fine here.

o	kernel/sched/completion.c completion_done()
	This use case appears to need to wait only on prior critical
	sections, as the only way we get past the "if" is when the lock is
	held by complete(), and you are only supposed to invoke complete()
	once on a given completion.  An acquire-only spin_unlock_wait()
	works here, but the race should be rare, so spin_lock+spin_unlock
	should also work fine here.

o	net/netfilter/nf_conntrack_core.c nf_conntrack_lock()
	This instance of spin_unlock_wait() interacts with
	nf_conntrack_all_lock()'s instance of spin_unlock_wait().
	Although nf_conntrack_all_lock() has an smp_mb(), which I
	believe provides release semantics given current implementations,
	nf_conntrack_lock() just has smp_rmb().

	I believe that the smp_rmb() needs to be smp_mb().  Am I missing
	something here that makes the current code safe on x86?

	I believe that this code could use spin_lock+spin_unlock without
	significant performance penalties -- I do not believe that
	nf_conntrack_locks_all_lock gets significant contention.

raw_spin_unlock_wait() (Courtesy of Andrea Parri with added commentary):

o	kernel/exit.c do_exit()
	Seems to rely on both acquire and release semantics. The
	raw_spin_unlock_wait() primitive is preceded by a smp_mb().
	But this is task exit doing spin_unlock_wait() on the task's
	lock, so spin_lock+spin_unlock should work fine here.

o	kernel/sched/core.c do_task_dead()
	Seems to rely on the acquire semantics only. The
	raw_spin_unlock_wait() primitive is preceded by an inexplicable
	smp_mb().  Again, this is task exit doing spin_unlock_wait() on
	the task's lock, so spin_lock+spin_unlock should work fine here.

o	kernel/task_work.c task_work_run()
	Seems to rely on the acquire semantics only.  This is to handle
	a race with task_work_cancel(), which appears to be quite rare.
	So the spin_lock+spin_unlock should work fine here.

spin_lock()/spin_unlock():

o	ipc/sem.c complexmode_enter()
	This used to be spin_unlock_wait(), but was changed to a
	spin_lock()/spin_unlock() pair by 27d7be1801a4 ("ipc/sem.c:
	avoid using spin_unlock_wait()").

Looks to me like we really can drop spin_unlock_wait() in favor of
momentarily acquiring the lock.  There are so few use cases that I don't
see a problem open-coding this.  I will put together yet another patch
series for my spin_unlock_wait() collection of patch serieses.  ;-)

> As regards (2), I did a little digging.  spin_unlock_wait was
> introduced in the 2.1.36 kernel, in mid-April 1997.  I wasn't able to
> find a specific patch for it in the LKML archives.  At the time it
> was used in only one place in the entire kernel (in kernel/exit.c):
> 
> void release(struct task_struct * p)
> {
> 	int i;
> 
> 	if (!p)
> 		return;
> 	if (p == current) {
> 		printk("task releasing itself\n");
> 		return;
> 	}
> 	for (i=1 ; i<NR_TASKS ; i++)
> 		if (task[i] == p) {
> #ifdef __SMP__
> 			/* FIXME! Cheesy, but kills the window... -DaveM */
> 			while(p->processor != NO_PROC_ID)
> 				barrier();
> 			spin_unlock_wait(&scheduler_lock);
> #endif
> 			nr_tasks--;
> 			task[i] = NULL;
> 			REMOVE_LINKS(p);
> 			release_thread(p);
> 			if (STACK_MAGIC != *(unsigned long *)p->kernel_stack_page)
> 				printk(KERN_ALERT "release: %s kernel stack corruption. Aiee\n", p->comm);
> 			free_kernel_stack(p->kernel_stack_page);
> 			current->cmin_flt += p->min_flt + p->cmin_flt;
> 			current->cmaj_flt += p->maj_flt + p->cmaj_flt;
> 			current->cnswap += p->nswap + p->cnswap;
> 			free_task_struct(p);
> 			return;
> 		}
> 	panic("trying to release non-existent task");
> }
> 
> I'm not entirely clear on the point of this call.  It looks like it 
> wanted to wait until p was guaranteed not to be running on any 
> processor ever again.  (I don't see why it couldn't have just acquired 
> the scheduler_lock -- was release() a particularly hot path?)
> 
> Although it doesn't matter now, this would mean that the original
> semantics of spin_unlock_wait were different from what we are
> discussing.  It apparently was meant to provide the release guarantee:
> any future critical sections would see the values that were visible
> before the call.  Ironic.

Cute!!!  ;-)

							Thanx, Paul