Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755728AbbDGPxq (ORCPT ); Tue, 7 Apr 2015 11:53:46 -0400 Received: from forward-corp1g.mail.yandex.net ([95.108.253.251]:45475 "EHLO forward-corp1g.mail.yandex.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753278AbbDGPxl (ORCPT ); Tue, 7 Apr 2015 11:53:41 -0400 Authentication-Results: smtpcorp1m.mail.yandex.net; dkim=pass header.i=@yandex-team.ru Message-ID: <5523FD82.30006@yandex-team.ru> Date: Tue, 07 Apr 2015 18:53:38 +0300 From: Konstantin Khlebnikov User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: bsegall@google.com CC: Peter Zijlstra , Ingo Molnar , linux-kernel@vger.kernel.org, Roman Gushchin , pjt@google.com Subject: Re: [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task() References: <20150403124138.1349.11633.stgit@buzz> In-Reply-To: Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4563 Lines: 119 On 07.04.2015 01:45, bsegall@google.com wrote: > Konstantin Khlebnikov writes: > >> Pick_next_task_fair() must be sure that here is at least one runnable >> task before calling put_prev_task(), but put_prev_task() can expire >> last remains of cfs quota and throttle all currently runnable tasks. >> As a result pick_next_task_fair() cannot find next task and crashes. >> >> This patch leaves 1 in ->runtime_remaining when current assignation >> expires and tries to refill it right after that. In the worst case >> task will be scheduled once and throttled at the end of slice. >> > > I don't think expire_cfs_rq_runtime is the problem. What I believe > happens is this: > > /prev/some_task is running, calls schedule() with nr_running == 2. > pick_next's first do/while loop does update_curr(/) and picks /next, and > the next iteration just sees check_cfs_rq_runtime(/next), and thus does > goto simple. However, there is now only /prev/some_task runnable, and it > hasn't checked the entire prev hierarchy for throttling, thus leading to > the crash. > > This would require that check_cfs_rq_runtime(/next) return true despite > being on_rq though, which iirc is not supposed to happen (note that we > do not call update_curr(/next), and it would do nothing if we did, > because /next isn't part of the current thread's hierarchy). However, > this /can/ happen if runtime has just been (re)enabled on /next, because > tg_set_cfs_bandwidth sets runtime_remaining to 0, not 1. Yeah, this seems possible too. > The idea was that each rq would grab runtime when they were scheduled > (pick_next_task_fair didn't ever look at throttling info), so this was > fine with the old code, but is a problem now. I think it would be > sufficient to just initialize to 1 in tg_set_cfs_bandwidth. The arguably > more precise option would be to only check_cfs_rq_runtime if > cfs_rq->curr is set, but the code is slightly less pretty. Probably this code should call something like account_cfs_rq_runtime(cfs_rq, 0); > > Could you check this patch to see if it works (or the trivial > tg_set_bandwidth runtime_remaining = 1 patch)? I'm not sure that I'll see this bug again: we're replacing this setup with something different. > > ---8<----->8--- > > From f12fa8e981bf1d87cbbc30951bdf27e70c803e25 Mon Sep 17 00:00:00 2001 > From: Ben Segall > Date: Mon, 6 Apr 2015 15:28:10 -0700 > Subject: [PATCH] sched: prevent throttle in early pick_next_task_fair > > The first call to check_cfs_rq_runtime in pick_next_task_fair is only supposed > to trigger when cfs_rq is still an ancestor of prev. However, it was able to > trigger on tgs that had just had bandwidth toggled, because tg_set_cfs_bandwidth > set runtime_remaining to 0, and check_cfs_rq_runtime doesn't check the global > pool. > > Fix this by only calling check_cfs_rq_runtime if we are still in prev's > ancestry, as evidenced by cfs_rq->curr. > > Reported-by: Konstantin Khlebnikov > Signed-off-by: Ben Segall > --- > kernel/sched/fair.c | 25 ++++++++++++++----------- > 1 file changed, 14 insertions(+), 11 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index ee595ef..5cb52e9 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -5038,18 +5038,21 @@ again: > * entity, update_curr() will update its vruntime, otherwise > * forget we've ever seen it. > */ > - if (curr && curr->on_rq) > - update_curr(cfs_rq); > - else > - curr = NULL; > + if (curr) { > + if (curr->on_rq) > + update_curr(cfs_rq); > + else > + curr = NULL; > > - /* > - * This call to check_cfs_rq_runtime() will do the throttle and > - * dequeue its entity in the parent(s). Therefore the 'simple' > - * nr_running test will indeed be correct. > - */ > - if (unlikely(check_cfs_rq_runtime(cfs_rq))) > - goto simple; > + /* > + * This call to check_cfs_rq_runtime() will do the > + * throttle and dequeue its entity in the parent(s). > + * Therefore the 'simple' nr_running test will indeed > + * be correct. > + */ > + if (unlikely(check_cfs_rq_runtime(cfs_rq))) > + goto simple; > + } > > se = pick_next_entity(cfs_rq, curr); > cfs_rq = group_cfs_rq(se); > -- Konstantin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/