Received: by 2002:ac0:a5a6:0:0:0:0:0 with SMTP id m35-v6csp1857999imm; Thu, 23 Aug 2018 09:53:57 -0700 (PDT) X-Google-Smtp-Source: AA+uWPxWiSF4ZJEq84ym6eSpBl+cgC5gdvq3l9thJuOgWRdnzjFkCzdPxKGx/ovzi9XfGomAziN4 X-Received: by 2002:a63:f:: with SMTP id 15-v6mr56963736pga.430.1535043237668; Thu, 23 Aug 2018 09:53:57 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1535043237; cv=none; d=google.com; s=arc-20160816; b=lDE+ejaAp5WQtYc5uhD2IJkeLHEsBXX4MasHl/KrqgiNCRYLi1Unsacjqkbvz3hIHr adJLHYxiiPUZWTFoknsQb/w2mIyb4ehgNjbU/oVb/1xWHcFot2LJuy8QeoGJQmDCw20A je06Hpkyl+l4h0mWpE+zjuz3h1a6gd/wkzQYvL+mtKh03tnayQfY31/yVhqk0G4UTQrc Slmu7qCvYVE5flyG2yvC//G90eToJ+vXpf7C1d/kuyWM4gMsbKjuHWW29TNMLcwkBHUz MhqjTdbrsNy3uyd7wujmRL2ZYLe4kNqDSqbymg53tX2jz75EQ3F1lqgTznxXAvjZQnQ2 WgeA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:arc-authentication-results; bh=lldI3GfwWBMbxhl6MGYaAMX2XmfSiVUSxLiIxbyXW1k=; b=J/owiz0MbicXfXcomD/E/ZYr52OXBfNNPQqVD82t7tNhPLnSx986YEZLKDPgOmwjAF VQC+9t70D9zsp+H1QdR9MGR1XGkQfLoU0ovaBL80dLAwtA9JnTItMbkNNgp8Xjy5rHWC esA2kth6/FbcgDvXGS6UuFXgza+Ac6cWFGgzeJftC/kZx1zFfE/mlWQbhb3r+VqSuZgk AhIrRiZOIY3rj36rqZp2Fw3eTqzKzy6JVkNWMUv03Ce8KPhGIMFGCmNJrobPdnzG7AtL LfYsQisBKgXoLw1MKl6h+LnKUrddhdyuSwKzx+IDA7KreVaUa4sFQrr26yDfZ39H7r/J GVtQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id bc3-v6si4486255plb.214.2018.08.23.09.53.41; Thu, 23 Aug 2018 09:53:57 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726362AbeHWUWt (ORCPT + 99 others); Thu, 23 Aug 2018 16:22:49 -0400 Received: from foss.arm.com ([217.140.101.70]:49154 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726148AbeHWUWt (ORCPT ); Thu, 23 Aug 2018 16:22:49 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 1CFB57A9; Thu, 23 Aug 2018 09:52:13 -0700 (PDT) Received: from [0.0.0.0] (e107985-lin.emea.arm.com [10.4.12.239]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 5F5843F2EA; Thu, 23 Aug 2018 09:52:10 -0700 (PDT) Subject: Re: [PATCH] sched/fair: vruntime should normalize when switching from fair To: Miguel de Dios , Steve Muckle , Peter Zijlstra , Ingo Molnar Cc: linux-kernel@vger.kernel.org, kernel-team@android.com, Todd Kjos , Paul Turner , Quentin Perret , Patrick Bellasi , Chris Redpath , Morten Rasmussen , John Dias References: <20180817182728.76129-1-smuckle@google.com> From: Dietmar Eggemann Message-ID: Date: Thu, 23 Aug 2018 18:52:08 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-GB Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, On 08/21/2018 01:54 AM, Miguel de Dios wrote: > On 08/17/2018 11:27 AM, Steve Muckle wrote: >> From: John Dias >> >> When rt_mutex_setprio changes a task's scheduling class to RT, >> we're seeing cases where the task's vruntime is not updated >> correctly upon return to the fair class. >> Specifically, the following is being observed: >> - task is deactivated while still in the fair class >> - task is boosted to RT via rt_mutex_setprio, which changes >>    the task to RT and calls check_class_changed. >> - check_class_changed leads to detach_task_cfs_rq, at which point >>    the vruntime_normalized check sees that the task's state is >> TASK_WAKING, >>    which results in skipping the subtraction of the rq's min_vruntime >>    from the task's vruntime >> - later, when the prio is deboosted and the task is moved back >>    to the fair class, the fair rq's min_vruntime is added to >>    the task's vruntime, even though it wasn't subtracted earlier. >> The immediate result is inflation of the task's vruntime, giving >> it lower priority (starving it if there's enough available work). >> The longer-term effect is inflation of all vruntimes because the >> task's vruntime becomes the rq's min_vruntime when the higher >> priority tasks go idle. That leads to a vicious cycle, where >> the vruntime inflation repeatedly doubled. >> >> The change here is to detect when vruntime_normalized is being >> called when the task is waking but is waking in another class, >> and to conclude that this is a case where vruntime has not >> been normalized. >> >> Signed-off-by: John Dias >> Signed-off-by: Steve Muckle >> --- >>   kernel/sched/fair.c | 3 ++- >>   1 file changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index b39fb596f6c1..14011d7929d8 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -9638,7 +9638,8 @@ static inline bool vruntime_normalized(struct >> task_struct *p) >>        * - A task which has been woken up by try_to_wake_up() and >>        *   waiting for actually being woken up by sched_ttwu_pending(). >>        */ >> -    if (!se->sum_exec_runtime || p->state == TASK_WAKING) >> +    if (!se->sum_exec_runtime || >> +        (p->state == TASK_WAKING && p->sched_class == >> &fair_sched_class)) >>           return true; >>       return false; > The normalization of vruntime used to exist in task_waking but it was > removed and the normalization was moved into migrate_task_rq_fair. The > reasoning being that task_waking_fair was only hit when a task is queued > onto a different core and migrate_task_rq_fair should do the same work. > > However, we're finding that there's one case which migrate_task_rq_fair > doesn't hit: that being the case where rt_mutex_setprio changes a task's > scheduling class to RT when its scheduled out. The task never hits > migrate_task_rq_fair because it is switched to RT and migrates as an RT > task. Because of this we're getting an unbounded addition of > min_vruntime when the task is re-attached to the CFS runqueue when it > loses the inherited priority. The patch above works because now the > kernel specifically checks for this case and normalizes accordingly. > > Here's the patch I was talking about: > https://lore.kernel.org/patchwork/patch/677689/. In our testing we were > seeing vruntimes nearly double every time after rt_mutex_setprio boosts > the task to RT. > > Signed-off-by: Miguel de Dios > Tested-by: Miguel de Dios I tried to catch this issue on my Arm64 Juno board using pi_test (and a slightly adapted pip_test (usleep_val = 1500 and keep low as cfs)) from rt-tests but wasn't able to do so. # pi_stress --inversions=1 --duration=1 --groups=1 --sched id=low,policy=cfs Starting PI Stress Test Number of thread groups: 1 Duration of test run: 1 seconds Number of inversions per group: 1 Admin thread SCHED_FIFO priority 4 1 groups of 3 threads will be created High thread SCHED_FIFO priority 3 Med thread SCHED_FIFO priority 2 Low thread SCHED_OTHER nice 0 # ./pip_stress In both cases, the cfs task entering rt_mutex_setprio() is queued, so dequeue_task_fair()->dequeue_entity(), which subtracts cfs_rq->min_vruntime from se->vruntime, is called on it before it gets the rt prio. Maybe it requires a very specific use of the pthread library to provoke this issue by making sure that the cfs tasks really blocks/sleeps?