Date: Fri, 30 Jun 2017 10:51:26 +0800
From: Boqun Feng <boqun.feng@gmail.com>
To: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Alan Stern <stern@rowland.harvard.edu>,
        Will Deacon <will.deacon@arm.com>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Andrea Parri <parri.andrea@gmail.com>,
        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
        priyalee.kushwaha@intel.com,
        =?utf-8?Q?Stanis=C5=82aw?= Drozd <drozdziak1@gmail.com>,
        Arnd Bergmann <arnd@arndb.de>, ldr709@gmail.com,
        Thomas Gleixner <tglx@linutronix.de>,
        Peter Zijlstra <peterz@infradead.org>,
        Josh Triplett <josh@joshtriplett.org>, Nicolas Pitre <nico@linaro.org>,
        Krister Johansen <kjlx@templeofstupid.com>,
        Vegard Nossum <vegard.nossum@oracle.com>, dcb314@hotmail.com,
        Wu Fengguang <fengguang.wu@intel.com>,
        Frederic Weisbecker <fweisbec@gmail.com>,
        Rik van Riel <riel@redhat.com>, Steven Rostedt <rostedt@goodmis.org>,
        Ingo Molnar <mingo@kernel.org>, Luc Maranget <luc.maranget@inria.fr>,
        Jade Alglave <j.alglave@ucl.ac.uk>
Subject: Re: [GIT PULL rcu/next] RCU commits for 4.13
Message-ID: <20170630025126.jhflffwrnedlqrmz@tardis>
References: <20170629113848.GA18630@arm.com>
 <Pine.LNX.4.44L0.1706291132310.1571-100000@iolanthe.rowland.org>
 <20170629181126.GA2393@linux.vnet.ibm.com>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha256;
        protocol="application/pgp-signature"; boundary="y2c7gvjwhkbuqrls"
Content-Disposition: inline
In-Reply-To: <20170629181126.GA2393@linux.vnet.ibm.com>
User-Agent: NeoMutt/20170225 (1.8.0)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 11686
Lines: 287


--y2c7gvjwhkbuqrls
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Jun 29, 2017 at 11:11:26AM -0700, Paul E. McKenney wrote:
> On Thu, Jun 29, 2017 at 11:59:27AM -0400, Alan Stern wrote:
> > On Thu, 29 Jun 2017, Will Deacon wrote:
> >=20
> > > [turns out I've not been on cc for this thread, but Jade pointed me t=
o it
> > >  and I see my name came up at some point!]
> > >=20
> > > On Wed, Jun 28, 2017 at 05:05:46PM -0700, Linus Torvalds wrote:
> > > > On Wed, Jun 28, 2017 at 4:54 PM, Paul E. McKenney
> > > > <paulmck@linux.vnet.ibm.com> wrote:
> > > > >
> > > > > Linus, are you dead-set against defining spin_unlock_wait() to be
> > > > > spin_lock + spin_unlock?  For example, is the current x86 impleme=
ntation
> > > > > of spin_unlock_wait() really a non-negotiable hard requirement?  =
Or
> > > > > would you be willing to live with the spin_lock + spin_unlock sem=
antics?
> > > >=20
> > > > So I think the "same as spin_lock + spin_unlock" semantics are kind=
 of insane.
> > > >=20
> > > > One of the issues is that the same as "spin_lock + spin_unlock" is
> > > > basically now architecture-dependent. Is it really the
> > > > architecture-dependent ordering you want to define this as?
> > > >=20
> > > > So I just think it's a *bad* definition. If somebody wants something
> > > > that is exactly equivalent to spin_lock+spin_unlock, then dammit, j=
ust
> > > > do *THAT*. It's completely pointless to me to define
> > > > spin_unlock_wait() in those terms.
> > > >=20
> > > > And if it's not equivalent to the *architecture* behavior of
> > > > spin_lock+spin_unlock, then I think it should be descibed in terms
> > > > that aren't about the architecture implementation (so you shouldn't
> > > > describe it as "spin_lock+spin_unlock", you should describe it in
> > > > terms of memory barrier semantics.
> > > >=20
> > > > And if we really have to use the spin_lock+spinunlock semantics for
> > > > this, then what is the advantage of spin_unlock_wait at all, if it
> > > > doesn't fundamentally avoid some locking overhead of just taking the
> > > > spinlock in the first place?
> > >=20
> > > Just on this point -- the arm64 code provides the same ordering seman=
tics
> > > as you would get from a lock;unlock sequence, but we can optimise that
> > > when compared to an actual lock;unlock sequence because we don't need=
 to
> > > wait in turn for our ticket. I suspect something similar could be done
> > > if/when we move to qspinlocks.
> > >=20
> > > Whether or not this is actually worth optimising is another question,=
 but
> > > it is worth noting that unlock_wait can be implemented more cheaply t=
han
> > > lock;unlock, whilst providing the same ordering guarantees (if that's
> > > really what we want -- see my reply to Paul).
> > >=20
> > > Simplicity tends to be my preference, so ripping this out would suit =
me
> > > best ;)
> >=20
> > It would be best to know:
> >=20
> > 	(1). How spin_unlock_wait() is currently being used.
> >=20
> > 	(2). What it was originally intended for.
> >=20
> > Paul has done some research into (1).  He can correct me if I get this
> > wrong...  Only a few (i.e., around one or two) of the usages don't seem
> > to require the full spin_lock+spin_unlock semantics.  I go along with
> > Linus; the places which really do want it to behave like
> > spin_lock+spin_unlock should simply use spin_lock+spin_unlock.  There
> > hasn't been any indication so far that the possible efficiency
> > improvement Will mentions is at all important.
> >=20
> > According to Paul, most of the other places don't need anything more
> > than the acquire guarantee (any changes made in earlier critical
> > sections will be visible to the code following spin_unlock_wait).  In
> > which case, the semantics of spin_unlock_wait could be redefined in
> > this simpler form.
> >=20
> > Or we could literally replace all the existing definitions with=20
> > spin_lock+spin_unlock.  Would that be so terrible?
>=20
> And here they are...
>=20
> spin_unlock_wait():
>=20
> o	drivers/ata/libata-eh.c ata_scsi_cmd_error_handler()
> 	spin_unlock_wait(ap->lock) in else-clause where then-clause has
> 	a full critical section for this same lock.  This use case could
> 	potentially require both acquire and release semantics.  (I am
> 	following up with the developers/maintainers, suggesting that
> 	they convert to spin_lock+spin_unlock if they need release
> 	semantics.)
>=20
> 	This is error-handling code, which should be rare, so
> 	spin_lock+spin_unlock should work fine here.  Probably shouldn't
> 	have bugged the maintainer, but email already sent.  :-/
>=20
> o	ipc/sem.c exit_sem()
> 	This use case appears to need to wait only on prior critical
> 	sections, as the only way we get here is if the entry has already
> 	been removed from the list.  An acquire-only spin_unlock_wait()
> 	works here.  However, this is sem-exit code, which is not a
> 	fastpath, and the race should be rare, so spin_lock+spin_unlock
> 	should work fine here.
>=20
> o	kernel/sched/completion.c completion_done()
> 	This use case appears to need to wait only on prior critical
> 	sections, as the only way we get past the "if" is when the lock is
> 	held by complete(), and you are only supposed to invoke complete()
> 	once on a given completion.  An acquire-only spin_unlock_wait()
> 	works here, but the race should be rare, so spin_lock+spin_unlock
> 	should also work fine here.
>=20
> o	net/netfilter/nf_conntrack_core.c nf_conntrack_lock()
> 	This instance of spin_unlock_wait() interacts with
> 	nf_conntrack_all_lock()'s instance of spin_unlock_wait().
> 	Although nf_conntrack_all_lock() has an smp_mb(), which I
> 	believe provides release semantics given current implementations,
> 	nf_conntrack_lock() just has smp_rmb().
>=20
> 	I believe that the smp_rmb() needs to be smp_mb().  Am I missing
> 	something here that makes the current code safe on x86?
>=20

actually i think the smp_rmb() or even along with the spin_unlock_wait()
in nf_conntrack_lock() is not needed, we could
implementnf_conntrack_lock() as:

=09
	void nf_conntrack_lock(spinlock_t *lock) __acquires(lock)
	{
		spin_lock(lock);
		while (unlikely(smp_load_acquire(nf_conntrack_locks_all))) {
			spin_unlock(lock);
			cpu_relaxed();
			spin_lock(lock);
		}
	}

because in nf_conntrack_all_unlock(), we have:

		smp_store_release(&nf_conntrack_locks_all, false);
		spin_unlock(&nf_conntrack_locks_all_lock);

so if we exit the loop, which means we observe nf_conntrack_locks_all
being false, we actually hold the per bucket lock and observe everything
before the smp_store_release(), which is the same as everything in the
critical section of nf_conntrack_locks_all_lock. Otherwise, we observe
the nf_conntrack_locks_all being true, which means a global lock
critical section may be on its way, we simply drop the per bucket lock
and test whether the global lock is finished again some time later.

So I think spin_unlock_wait() in the nf_conntrack_lock() just requires
acquire semantics, at least.

Maybe I miss someting?

> 	I believe that this code could use spin_lock+spin_unlock without
> 	significant performance penalties -- I do not believe that
> 	nf_conntrack_locks_all_lock gets significant contention.
>=20
> raw_spin_unlock_wait() (Courtesy of Andrea Parri with added commentary):
>=20
> o	kernel/exit.c do_exit()
> 	Seems to rely on both acquire and release semantics. The
> 	raw_spin_unlock_wait() primitive is preceded by a smp_mb().
> 	But this is task exit doing spin_unlock_wait() on the task's
> 	lock, so spin_lock+spin_unlock should work fine here.
>=20
> o	kernel/sched/core.c do_task_dead()
> 	Seems to rely on the acquire semantics only. The
> 	raw_spin_unlock_wait() primitive is preceded by an inexplicable
> 	smp_mb().  Again, this is task exit doing spin_unlock_wait() on
> 	the task's lock, so spin_lock+spin_unlock should work fine here.
>=20
> o	kernel/task_work.c task_work_run()
> 	Seems to rely on the acquire semantics only.  This is to handle

I think this one needs the stronger semantics, the smp_mb() is just
hidden in the cmpxchg() before the raw_spin_unlock_wait() ;-)

cmpxchg() sets a special value to indicate the task_work has been taken,
and raw_spin_unlock_wait() must wait until the next critical section of
->pi_lock(in task_work_cancel()) could observe this, otherwise we may
cancel a task_work while executing it.

Regards,
Boqun
> 	a race with task_work_cancel(), which appears to be quite rare.
> 	So the spin_lock+spin_unlock should work fine here.
>=20
> spin_lock()/spin_unlock():
>=20
> o	ipc/sem.c complexmode_enter()
> 	This used to be spin_unlock_wait(), but was changed to a
> 	spin_lock()/spin_unlock() pair by 27d7be1801a4 ("ipc/sem.c:
> 	avoid using spin_unlock_wait()").
>=20
> Looks to me like we really can drop spin_unlock_wait() in favor of
> momentarily acquiring the lock.  There are so few use cases that I don't
> see a problem open-coding this.  I will put together yet another patch
> series for my spin_unlock_wait() collection of patch serieses.  ;-)
>=20
> > As regards (2), I did a little digging.  spin_unlock_wait was
> > introduced in the 2.1.36 kernel, in mid-April 1997.  I wasn't able to
> > find a specific patch for it in the LKML archives.  At the time it
> > was used in only one place in the entire kernel (in kernel/exit.c):
> >=20
> > void release(struct task_struct * p)
> > {
> > 	int i;
> >=20
> > 	if (!p)
> > 		return;
> > 	if (p =3D=3D current) {
> > 		printk("task releasing itself\n");
> > 		return;
> > 	}
> > 	for (i=3D1 ; i<NR_TASKS ; i++)
> > 		if (task[i] =3D=3D p) {
> > #ifdef __SMP__
> > 			/* FIXME! Cheesy, but kills the window... -DaveM */
> > 			while(p->processor !=3D NO_PROC_ID)
> > 				barrier();
> > 			spin_unlock_wait(&scheduler_lock);
> > #endif
> > 			nr_tasks--;
> > 			task[i] =3D NULL;
> > 			REMOVE_LINKS(p);
> > 			release_thread(p);
> > 			if (STACK_MAGIC !=3D *(unsigned long *)p->kernel_stack_page)
> > 				printk(KERN_ALERT "release: %s kernel stack corruption. Aiee\n", p-=
>comm);
> > 			free_kernel_stack(p->kernel_stack_page);
> > 			current->cmin_flt +=3D p->min_flt + p->cmin_flt;
> > 			current->cmaj_flt +=3D p->maj_flt + p->cmaj_flt;
> > 			current->cnswap +=3D p->nswap + p->cnswap;
> > 			free_task_struct(p);
> > 			return;
> > 		}
> > 	panic("trying to release non-existent task");
> > }
> >=20
> > I'm not entirely clear on the point of this call.  It looks like it=20
> > wanted to wait until p was guaranteed not to be running on any=20
> > processor ever again.  (I don't see why it couldn't have just acquired=
=20
> > the scheduler_lock -- was release() a particularly hot path?)
> >=20
> > Although it doesn't matter now, this would mean that the original
> > semantics of spin_unlock_wait were different from what we are
> > discussing.  It apparently was meant to provide the release guarantee:
> > any future critical sections would see the values that were visible
> > before the call.  Ironic.
>=20
> Cute!!!  ;-)
>=20
> 							Thanx, Paul
>=20

--y2c7gvjwhkbuqrls
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iQEzBAABCAAdFiEEj5IosQTPz8XU1wRHSXnow7UH+rgFAllVvKoACgkQSXnow7UH
+rjgiAgAn6iGPgBIFkA+sCTsfhwHihKWnCX+oUauBfkhjRlM8ywmOQfsMKsLppC5
ryUCgh3zDECKvZZPb9WqE3LlK31E77bYe+NixSoLtXe1lxeh6GHz5BefUlmSMOjZ
9akWGLtGbibKy5dEd1sSbdiLGcR5BjfKdA9c34dZB9Drt8uYyx+bSVyq1fhq2/sL
uSXC6C4+g0d+GHnRO25gCzBbruSCxCPbD5kmDCZTycHyMEaGPBaD9wakiJ4//zrj
mwMJmi7vTDqdMon5ZJjHGNHg+ji6ZLn01gXZWam6Zwq/FcMRERyLAb99Pira6gJJ
mYYJswNq/XDtWoexXgForqL1zl4UGQ==
=BpwI
-----END PGP SIGNATURE-----

--y2c7gvjwhkbuqrls--