Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751916AbdF3ECx (ORCPT ); Fri, 30 Jun 2017 00:02:53 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:51317 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751026AbdF3ECv (ORCPT ); Fri, 30 Jun 2017 00:02:51 -0400 Date: Thu, 29 Jun 2017 21:02:41 -0700 From: "Paul E. McKenney" To: Boqun Feng Cc: Alan Stern , Will Deacon , Linus Torvalds , Andrea Parri , Linux Kernel Mailing List , priyalee.kushwaha@intel.com, =?utf-8?Q?Stanis=C5=82aw?= Drozd , Arnd Bergmann , ldr709@gmail.com, Thomas Gleixner , Peter Zijlstra , Josh Triplett , Nicolas Pitre , Krister Johansen , Vegard Nossum , dcb314@hotmail.com, Wu Fengguang , Frederic Weisbecker , Rik van Riel , Steven Rostedt , Ingo Molnar , Luc Maranget , Jade Alglave Subject: Re: [GIT PULL rcu/next] RCU commits for 4.13 Reply-To: paulmck@linux.vnet.ibm.com References: <20170629113848.GA18630@arm.com> <20170629181126.GA2393@linux.vnet.ibm.com> <20170630025126.jhflffwrnedlqrmz@tardis> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170630025126.jhflffwrnedlqrmz@tardis> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 x-cbid: 17063004-0048-0000-0000-000001B72535 X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00007296; HX=3.00000241; KW=3.00000007; PH=3.00000004; SC=3.00000214; SDB=6.00880685; UDB=6.00439058; IPR=6.00660838; BA=6.00005448; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00016018; XFM=3.00000015; UTC=2017-06-30 04:02:48 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17063004-0049-0000-0000-000041B6CF84 Message-Id: <20170630040241.GR2393@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-06-30_03:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=2 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1703280000 definitions=main-1706300063 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 11685 Lines: 263 On Fri, Jun 30, 2017 at 10:51:26AM +0800, Boqun Feng wrote: > On Thu, Jun 29, 2017 at 11:11:26AM -0700, Paul E. McKenney wrote: > > On Thu, Jun 29, 2017 at 11:59:27AM -0400, Alan Stern wrote: > > > On Thu, 29 Jun 2017, Will Deacon wrote: > > > > > > > [turns out I've not been on cc for this thread, but Jade pointed me to it > > > > and I see my name came up at some point!] > > > > > > > > On Wed, Jun 28, 2017 at 05:05:46PM -0700, Linus Torvalds wrote: > > > > > On Wed, Jun 28, 2017 at 4:54 PM, Paul E. McKenney > > > > > wrote: > > > > > > > > > > > > Linus, are you dead-set against defining spin_unlock_wait() to be > > > > > > spin_lock + spin_unlock? For example, is the current x86 implementation > > > > > > of spin_unlock_wait() really a non-negotiable hard requirement? Or > > > > > > would you be willing to live with the spin_lock + spin_unlock semantics? > > > > > > > > > > So I think the "same as spin_lock + spin_unlock" semantics are kind of insane. > > > > > > > > > > One of the issues is that the same as "spin_lock + spin_unlock" is > > > > > basically now architecture-dependent. Is it really the > > > > > architecture-dependent ordering you want to define this as? > > > > > > > > > > So I just think it's a *bad* definition. If somebody wants something > > > > > that is exactly equivalent to spin_lock+spin_unlock, then dammit, just > > > > > do *THAT*. It's completely pointless to me to define > > > > > spin_unlock_wait() in those terms. > > > > > > > > > > And if it's not equivalent to the *architecture* behavior of > > > > > spin_lock+spin_unlock, then I think it should be descibed in terms > > > > > that aren't about the architecture implementation (so you shouldn't > > > > > describe it as "spin_lock+spin_unlock", you should describe it in > > > > > terms of memory barrier semantics. > > > > > > > > > > And if we really have to use the spin_lock+spinunlock semantics for > > > > > this, then what is the advantage of spin_unlock_wait at all, if it > > > > > doesn't fundamentally avoid some locking overhead of just taking the > > > > > spinlock in the first place? > > > > > > > > Just on this point -- the arm64 code provides the same ordering semantics > > > > as you would get from a lock;unlock sequence, but we can optimise that > > > > when compared to an actual lock;unlock sequence because we don't need to > > > > wait in turn for our ticket. I suspect something similar could be done > > > > if/when we move to qspinlocks. > > > > > > > > Whether or not this is actually worth optimising is another question, but > > > > it is worth noting that unlock_wait can be implemented more cheaply than > > > > lock;unlock, whilst providing the same ordering guarantees (if that's > > > > really what we want -- see my reply to Paul). > > > > > > > > Simplicity tends to be my preference, so ripping this out would suit me > > > > best ;) > > > > > > It would be best to know: > > > > > > (1). How spin_unlock_wait() is currently being used. > > > > > > (2). What it was originally intended for. > > > > > > Paul has done some research into (1). He can correct me if I get this > > > wrong... Only a few (i.e., around one or two) of the usages don't seem > > > to require the full spin_lock+spin_unlock semantics. I go along with > > > Linus; the places which really do want it to behave like > > > spin_lock+spin_unlock should simply use spin_lock+spin_unlock. There > > > hasn't been any indication so far that the possible efficiency > > > improvement Will mentions is at all important. > > > > > > According to Paul, most of the other places don't need anything more > > > than the acquire guarantee (any changes made in earlier critical > > > sections will be visible to the code following spin_unlock_wait). In > > > which case, the semantics of spin_unlock_wait could be redefined in > > > this simpler form. > > > > > > Or we could literally replace all the existing definitions with > > > spin_lock+spin_unlock. Would that be so terrible? > > > > And here they are... > > > > spin_unlock_wait(): > > > > o drivers/ata/libata-eh.c ata_scsi_cmd_error_handler() > > spin_unlock_wait(ap->lock) in else-clause where then-clause has > > a full critical section for this same lock. This use case could > > potentially require both acquire and release semantics. (I am > > following up with the developers/maintainers, suggesting that > > they convert to spin_lock+spin_unlock if they need release > > semantics.) > > > > This is error-handling code, which should be rare, so > > spin_lock+spin_unlock should work fine here. Probably shouldn't > > have bugged the maintainer, but email already sent. :-/ > > > > o ipc/sem.c exit_sem() > > This use case appears to need to wait only on prior critical > > sections, as the only way we get here is if the entry has already > > been removed from the list. An acquire-only spin_unlock_wait() > > works here. However, this is sem-exit code, which is not a > > fastpath, and the race should be rare, so spin_lock+spin_unlock > > should work fine here. > > > > o kernel/sched/completion.c completion_done() > > This use case appears to need to wait only on prior critical > > sections, as the only way we get past the "if" is when the lock is > > held by complete(), and you are only supposed to invoke complete() > > once on a given completion. An acquire-only spin_unlock_wait() > > works here, but the race should be rare, so spin_lock+spin_unlock > > should also work fine here. > > > > o net/netfilter/nf_conntrack_core.c nf_conntrack_lock() > > This instance of spin_unlock_wait() interacts with > > nf_conntrack_all_lock()'s instance of spin_unlock_wait(). > > Although nf_conntrack_all_lock() has an smp_mb(), which I > > believe provides release semantics given current implementations, > > nf_conntrack_lock() just has smp_rmb(). > > > > I believe that the smp_rmb() needs to be smp_mb(). Am I missing > > something here that makes the current code safe on x86? > > > > actually i think the smp_rmb() or even along with the spin_unlock_wait() > in nf_conntrack_lock() is not needed, we could > implementnf_conntrack_lock() as: > > > void nf_conntrack_lock(spinlock_t *lock) __acquires(lock) > { > spin_lock(lock); > while (unlikely(smp_load_acquire(nf_conntrack_locks_all))) { > spin_unlock(lock); > cpu_relaxed(); > spin_lock(lock); > } > } > > because in nf_conntrack_all_unlock(), we have: > > smp_store_release(&nf_conntrack_locks_all, false); > spin_unlock(&nf_conntrack_locks_all_lock); > > so if we exit the loop, which means we observe nf_conntrack_locks_all > being false, we actually hold the per bucket lock and observe everything > before the smp_store_release(), which is the same as everything in the > critical section of nf_conntrack_locks_all_lock. Otherwise, we observe > the nf_conntrack_locks_all being true, which means a global lock > critical section may be on its way, we simply drop the per bucket lock > and test whether the global lock is finished again some time later. > > So I think spin_unlock_wait() in the nf_conntrack_lock() just requires > acquire semantics, at least. > > Maybe I miss someting? Or perhaps I was being too paranoid. But does the same analysis work in the case where an nf_conntrack_lock races with an nf_contrack_all_lock()? > > I believe that this code could use spin_lock+spin_unlock without > > significant performance penalties -- I do not believe that > > nf_conntrack_locks_all_lock gets significant contention. > > > > raw_spin_unlock_wait() (Courtesy of Andrea Parri with added commentary): > > > > o kernel/exit.c do_exit() > > Seems to rely on both acquire and release semantics. The > > raw_spin_unlock_wait() primitive is preceded by a smp_mb(). > > But this is task exit doing spin_unlock_wait() on the task's > > lock, so spin_lock+spin_unlock should work fine here. > > > > o kernel/sched/core.c do_task_dead() > > Seems to rely on the acquire semantics only. The > > raw_spin_unlock_wait() primitive is preceded by an inexplicable > > smp_mb(). Again, this is task exit doing spin_unlock_wait() on > > the task's lock, so spin_lock+spin_unlock should work fine here. > > > > o kernel/task_work.c task_work_run() > > Seems to rely on the acquire semantics only. This is to handle > > I think this one needs the stronger semantics, the smp_mb() is just > hidden in the cmpxchg() before the raw_spin_unlock_wait() ;-) > > cmpxchg() sets a special value to indicate the task_work has been taken, > and raw_spin_unlock_wait() must wait until the next critical section of > ->pi_lock(in task_work_cancel()) could observe this, otherwise we may > cancel a task_work while executing it. But either way, replacing the spin_unlock_wait() with a spin_lock() immediately followed by a spin_unlock() should work correctly, right? Thanx, Paul > Regards, > Boqun > > a race with task_work_cancel(), which appears to be quite rare. > > So the spin_lock+spin_unlock should work fine here. > > > > spin_lock()/spin_unlock(): > > > > o ipc/sem.c complexmode_enter() > > This used to be spin_unlock_wait(), but was changed to a > > spin_lock()/spin_unlock() pair by 27d7be1801a4 ("ipc/sem.c: > > avoid using spin_unlock_wait()"). > > > > Looks to me like we really can drop spin_unlock_wait() in favor of > > momentarily acquiring the lock. There are so few use cases that I don't > > see a problem open-coding this. I will put together yet another patch > > series for my spin_unlock_wait() collection of patch serieses. ;-) > > > > > As regards (2), I did a little digging. spin_unlock_wait was > > > introduced in the 2.1.36 kernel, in mid-April 1997. I wasn't able to > > > find a specific patch for it in the LKML archives. At the time it > > > was used in only one place in the entire kernel (in kernel/exit.c): > > > > > > void release(struct task_struct * p) > > > { > > > int i; > > > > > > if (!p) > > > return; > > > if (p == current) { > > > printk("task releasing itself\n"); > > > return; > > > } > > > for (i=1 ; i > > if (task[i] == p) { > > > #ifdef __SMP__ > > > /* FIXME! Cheesy, but kills the window... -DaveM */ > > > while(p->processor != NO_PROC_ID) > > > barrier(); > > > spin_unlock_wait(&scheduler_lock); > > > #endif > > > nr_tasks--; > > > task[i] = NULL; > > > REMOVE_LINKS(p); > > > release_thread(p); > > > if (STACK_MAGIC != *(unsigned long *)p->kernel_stack_page) > > > printk(KERN_ALERT "release: %s kernel stack corruption. Aiee\n", p->comm); > > > free_kernel_stack(p->kernel_stack_page); > > > current->cmin_flt += p->min_flt + p->cmin_flt; > > > current->cmaj_flt += p->maj_flt + p->cmaj_flt; > > > current->cnswap += p->nswap + p->cnswap; > > > free_task_struct(p); > > > return; > > > } > > > panic("trying to release non-existent task"); > > > } > > > > > > I'm not entirely clear on the point of this call. It looks like it > > > wanted to wait until p was guaranteed not to be running on any > > > processor ever again. (I don't see why it couldn't have just acquired > > > the scheduler_lock -- was release() a particularly hot path?) > > > > > > Although it doesn't matter now, this would mean that the original > > > semantics of spin_unlock_wait were different from what we are > > > discussing. It apparently was meant to provide the release guarantee: > > > any future critical sections would see the values that were visible > > > before the call. Ironic. > > > > Cute!!! ;-) > > > > Thanx, Paul > >