Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752511Ab1FCGtl (ORCPT ); Fri, 3 Jun 2011 02:49:41 -0400 Received: from mail-yw0-f46.google.com ([209.85.213.46]:46041 "EHLO mail-yw0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751829Ab1FCGtj convert rfc822-to-8bit (ORCPT ); Fri, 3 Jun 2011 02:49:39 -0400 DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=FwdNEqvhMBAKk9p336CK9Z+BgjtIhe1t0SBcB03xb4vVvVmmkjjPWmwR9OqtvSRwqK KgVSSW01lWn8H2IQu6qn5kCd0Is8uc0rd+HrbuxqfFe53LhiprWdONs2Q6Mhlxat/cNv BbmwozU0Tn+pehSZYGIdIVTPfJZGgfX6jLnuc= MIME-Version: 1.0 In-Reply-To: <1307029711.2497.717.camel@laptop> References: <1306835745.2353.3.camel@twins> <20110531125621.GA24439@gere.osrc.amd.com> <1306847516.2353.80.camel@twins> <20110601070547.GB3368@liondog.tnic> <1306924612.2353.176.camel@twins> <20110601155017.GD24028@aftab> <1307019866.2497.675.camel@laptop> <20110602142340.GA3356@zhy> <1307029711.2497.717.camel@laptop> Date: Fri, 3 Jun 2011 14:49:38 +0800 Message-ID: Subject: Re: [tip:sched/urgent] sched: Fix cross-cpu clock sync on remote wakeups From: Yong Zhang To: Peter Zijlstra Cc: Borislav Petkov , Borislav Petkov , "mingo@redhat.com" , "hpa@zytor.com" , "linux-kernel@vger.kernel.org" , "markus@trippelsdorf.de" , "tglx@linutronix.de" , "mingo@elte.hu" , "linux-tip-commits@vger.kernel.org" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3890 Lines: 116 On Thu, Jun 2, 2011 at 11:48 PM, Peter Zijlstra wrote: > On Thu, 2011-06-02 at 22:23 +0800, Yong Zhang wrote: >> On Thu, Jun 02, 2011 at 03:04:26PM +0200, Peter Zijlstra wrote: >> > On Thu, 2011-06-02 at 15:52 +0800, Yong Zhang wrote: >> > > In sched_clock_local(), clock is calculated around ->tick_gtod even if >> > > that ->tick_gtod is stale for long time because we stays in idle state. >> > > You know ->tick_gtod is only updated in sched_clock_tick(); >> > >> > (well, no, there's idle callbacks as you said below) >> > >> > > IOW, when a cpu goes out of idle, sched_clock_tick() is called from >> > > tick_nohz_stop_idle() which is later than interrupt. >> > >> > Gah, that would be awefull and mean wakeups from interrupts were already >> > borken. /me goes look at code. >> > >> > irq_enter() -> tick_check_idle() -> tick_check_nohz() -> >> > tick_nohz_stop_idle() -> sched_clock_idle_wakeup_event() >> > >> > should update the thing before we run any isrs, right? >> >> Hmmm, you are right. >> >> But smp_reschedule_interrupt() doesn't call irq_enter()/irq_exit(), >> is that correct? > > Crap.. you're right. > And I bet other archs don't do that either. Most of them ;) I only notice sparc32 do that. Maybe there have more, but I didn't check it very carefully. > With > NO_HZ you really need irq_enter() for pretty much all interrupts so I > was assuming the resched IPI had it, but its been special and never > really needed it. If it would wake an idle cpu the idle loop exit would > deal with it, if it interrupted userspace the thing was running and > NO_HZ wasn't relevant. > > Damn. > > And yes, the only reason I didn't see this on my dev box was because we > do indeed set that sched_clock_stable thing on wsm. And I never noticed > on my desktop because firefox/X/etc. consuming heaps of CPU isn't weird > at all. > > Adding it to all resched int handlers is of course a possibility but > would slow down the thing, although with the new code, most users are > now indeed wakeups (excepting weird and wonderful users like KVM). > > We could of course add it in sched.c since the logic recurses just > fine.. its not pretty though.. :/ Yeah, IMHO it's suitable here and my test looks good. Reviewed-and-Tested-by: Yong Zhang BTW, sched_ipi() and sched_ttwu_pending() could share a piece of code now. And we place irq_enter()/irq_exit() in sched_ipi() because it's the only function we could call, thus account_system_vtime() could get the almost exact time value. IOW we should pay some attention on the future change of smp_reschedule_interrupt(). Thanks, Yong > > Thoughts? > > --- >  kernel/sched.c |   18 +++++++++++++++++- >  1 files changed, 17 insertions(+), 1 deletions(-) > > diff --git a/kernel/sched.c b/kernel/sched.c > index 2fe98ed..365ed6b 100644 > --- a/kernel/sched.c > +++ b/kernel/sched.c > @@ -2554,7 +2554,23 @@ static void sched_ttwu_pending(void) > >  void scheduler_ipi(void) >  { > -       sched_ttwu_pending(); > +       struct rq *rq = this_rq(); > +       struct task_struct *list = xchg(&rq->wake_list, NULL); > + > +       if (!list) > +               return; > + > +       irq_enter(); > +       raw_spin_lock(&rq->lock); > + > +       while (list) { > +               struct task_struct *p = list; > +               list = list->wake_entry; > +               ttwu_do_activate(rq, p, 0); > +       } > + > +       raw_spin_unlock(&rq->lock); > +       irq_exit(); >  } > >  static void ttwu_queue_remote(struct task_struct *p, int cpu) > > > -- Only stand for myself -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/