Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754613Ab3I0Ulh (ORCPT ); Fri, 27 Sep 2013 16:41:37 -0400 Received: from merlin.infradead.org ([205.233.59.134]:34263 "EHLO merlin.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752941Ab3I0Ulf (ORCPT ); Fri, 27 Sep 2013 16:41:35 -0400 Date: Fri, 27 Sep 2013 22:41:16 +0200 From: Peter Zijlstra To: Oleg Nesterov Cc: "Paul E. McKenney" , Mel Gorman , Rik van Riel , Srikar Dronamraju , Ingo Molnar , Andrea Arcangeli , Johannes Weiner , Linux-MM , LKML , Thomas Gleixner , Steven Rostedt Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus() Message-ID: <20130927204116.GJ15690@laptop.programming.kicks-ass.net> References: <20130924202423.GW12926@twins.programming.kicks-ass.net> <20130925155515.GA17447@redhat.com> <20130925174307.GA3220@laptop.programming.kicks-ass.net> <20130925175055.GA25914@redhat.com> <20130925184015.GC3657@laptop.programming.kicks-ass.net> <20130925212200.GA7959@linux.vnet.ibm.com> <20130926111042.GS3081@twins.programming.kicks-ass.net> <20130926165840.GA863@redhat.com> <20130926175016.GI3657@laptop.programming.kicks-ass.net> <20130927181532.GA8401@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130927181532.GA8401@redhat.com> User-Agent: Mutt/1.5.21 (2012-12-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5640 Lines: 164 On Fri, Sep 27, 2013 at 08:15:32PM +0200, Oleg Nesterov wrote: > On 09/26, Peter Zijlstra wrote: > > > > But if the readers does see BLOCK it will not be an active reader no > > more; and thus the writer doesn't need to observe and wait for it. > > I meant they both can block, but please ignore. Today I simply can't > understand what I was thinking about yesterday. I think we all know that state all too well ;-) > I tried hard to find any hole in this version but failed, I believe it > is correct. Yay! > But, could you help me to understand some details? I'll try, but I'm not too bright atm myself :-) > > +void __get_online_cpus(void) > > +{ > > +again: > > + /* See __srcu_read_lock() */ > > + __this_cpu_inc(__cpuhp_refcount); > > + smp_mb(); /* A matches B, E */ > > + __this_cpu_inc(cpuhp_seq); > > + > > + if (unlikely(__cpuhp_state == readers_block)) { > > Note that there is no barrier() after inc(seq) and __cpuhp_state > check, this inc() can be "postponed" till ... > > > +void __put_online_cpus(void) > > { > > + /* See __srcu_read_unlock() */ > > + smp_mb(); /* C matches D */ > > ... this mb() in __put_online_cpus(). > > And this is fine! The qustion is, perhaps it would be more "natural" > and understandable to shift this_cpu_inc(cpuhp_seq) into > __put_online_cpus(). Possibly; I never got further than that the required order is: ref++ MB seq++ MB ref-- It doesn't matter if the seq++ is in the lock or unlock primitive. I never considered one place more natural than the other. > We need to ensure 2 things: > > 1. The reader should notic state = BLOCK or the writer should see > inc(__cpuhp_refcount). This is guaranteed by 2 mb's in > __get_online_cpus() and in cpu_hotplug_begin(). > > We do not care if the writer misses some inc(__cpuhp_refcount) > in per_cpu_sum(__cpuhp_refcount), that reader(s) should notice > state = readers_block (and inc(cpuhp_seq) can't help anyway). Agreed. > 2. If the writer sees the result of this_cpu_dec(__cpuhp_refcount) > from __put_online_cpus() (note that the writer can miss the > corresponding inc() if it was done on another CPU, so this dec() > can lead to sum() == 0), it should also notice the change in cpuhp_seq. > > Fortunately, this can only happen if the reader migrates, in > this case schedule() provides a barrier, the writer can't miss > the change in cpuhp_seq. Again, agreed; this is also the message of the second comment in cpuhp_readers_active_check() by Paul. > IOW. Unless I missed something, cpuhp_seq is actually needed to > serialize __put_online_cpus()->this_cpu_dec(__cpuhp_refcount) and > and /* D matches C */ in cpuhp_readers_active_check(), and this > is not immediately clear if you look at __get_online_cpus(). > > I do not suggest to change this code, but please tell me if my > understanding is not correct. I think you're entirely right. > > +static bool cpuhp_readers_active_check(void) > > { > > + unsigned int seq = per_cpu_sum(cpuhp_seq); > > + > > + smp_mb(); /* B matches A */ > > + > > + /* > > + * In other words, if we see __get_online_cpus() cpuhp_seq increment, > > + * we are guaranteed to also see its __cpuhp_refcount increment. > > + */ > > > > + if (per_cpu_sum(__cpuhp_refcount) != 0) > > + return false; > > > > + smp_mb(); /* D matches C */ > > It seems that both barries could be smp_rmb() ? I am not sure the comments > from srcu_readers_active_idx_check() can explain mb(), note that > __srcu_read_lock() always succeeds unlike get_cpus_online(). I see what you mean; cpuhp_readers_active_check() is all purely reads; there are no writes to order. Paul; is there any argument for the MB here as opposed to RMB; and if not should we change both these and SRCU? > > void cpu_hotplug_done(void) > > { > > + /* Signal the writer is done, no fast path yet. */ > > + __cpuhp_state = readers_slow; > > + wake_up_all(&cpuhp_readers); > > + > > + /* > > + * The wait_event()/wake_up_all() prevents the race where the readers > > + * are delayed between fetching __cpuhp_state and blocking. > > + */ > > + > > + /* See percpu_up_write(); readers will no longer attempt to block. */ > > + synchronize_sched(); > > + > > + /* Let 'em rip */ > > + __cpuhp_state = readers_fast; > > + current->cpuhp_ref--; > > + > > + /* > > + * Wait for any pending readers to be running. This ensures readers > > + * after writer and avoids writers starving readers. > > + */ > > + wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount)); > > } > > OK, to some degree I can understand "avoids writers starving readers" > part (although the next writer should do synchronize_sched() first), > but could you explain "ensures readers after writer" ? Suppose reader A sees state == BLOCK and goes to sleep; our writer B does cpu_hotplug_done() and wakes all pending readers. If for some reason A doesn't schedule to inc ref until B again executes cpu_hotplug_begin() and state is once again BLOCK, A will not have made any progress. The waitcount increment before __put_online_cpus() ensures cpu_hotplug_done() sees the !0 waitcount and waits until out reader runs far enough to at least pass the dec_and_test(). And once past the dec_and_test() preemption is disabled and the sched_sync() in a new cpu_hotplug_begin() will suffice to guarantee we'll have acquired a reference and are an active reader. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/