Received: by 2002:ac0:a5b6:0:0:0:0:0 with SMTP id m51-v6csp3397150imm; Tue, 29 May 2018 06:37:43 -0700 (PDT) X-Google-Smtp-Source: AB8JxZqBhO087cimbatjQB6PRGHZCFge5aFGyoM1W1uaNzgCAV9BCG3ksgIcZiFwsh4XqAeoTEY8 X-Received: by 2002:a17:902:9a8a:: with SMTP id w10-v6mr17844363plp.333.1527601063777; Tue, 29 May 2018 06:37:43 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1527601063; cv=none; d=google.com; s=arc-20160816; b=vDLjKQOaFhrv4scLJi01Ow/wD1KM6KA1sFkljJ8HqtrkPusNde/Z2YTBydfO8QPnxc 9VLLMN1DKoZK2j4CoqND6JgNeDtIY0YTQpBYOX8ia9mWz0jiityBImKZHW3SNlDx6oKB Jg91GMwq50ag4xNfYDoe/FtLi72626JNHFh/oixN7fdTQ7RLgrXIuF/kSrh5x7/DFd9c 7BuG8CowqV7ronVNSStc4dv5b7SJCW2l0aSDz+Ep8txmSc5X9TzVvjd0Rtrdk2Pw90Ey +3cq/1PgRvc+QZmi/2joQAXPsfPECBUsj2Gt1LwC82jizEdE3yJiPG1aKOJgDiyTH1rG ZIFw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :references:in-reply-to:mime-version:dkim-signature :arc-authentication-results; bh=GiBJFIxWFKdzMXlw13BrIhIVJ6tGaL8oTxAvKvBU1tg=; b=AnZPTnsky3vOq6IvSN4DAkn8w1dRonHjnGWhkN4g5RdafPIRBYDZZKpJbpc8JABTTc FNuNhBodtUMRgzzrSaHFYadjJMD78gNonx1EnLr9r5s+7TA+SoCLLTcSCEu1FhghZK47 YZtHeZJHqcN3aW9r/hPAJoouSypNv/Gpi5Eg9C/3SooeKSovWq3yE+Y4D8fzoh/ffE+r 1/gffTg+zbl2gncY+qzqC+Zo731563eVy0yd7cfILkVqnrgce6xUB4vdTmkFdBsRiX6s BmvCl0zSMFOLEQYPRJYsw/ulfV2lJjJ1WAvxaPejXmgYpIiw7J5hZ1FUmfbRUjT7FP+5 cvWg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=Gs5JSAax; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id v7-v6si7315979pgb.23.2018.05.29.06.36.59; Tue, 29 May 2018 06:37:43 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=Gs5JSAax; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934417AbeE2NaW (ORCPT + 99 others); Tue, 29 May 2018 09:30:22 -0400 Received: from mail-it0-f68.google.com ([209.85.214.68]:56183 "EHLO mail-it0-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934084AbeE2NaT (ORCPT ); Tue, 29 May 2018 09:30:19 -0400 Received: by mail-it0-f68.google.com with SMTP id 144-v6so18354378iti.5 for ; Tue, 29 May 2018 06:30:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=GiBJFIxWFKdzMXlw13BrIhIVJ6tGaL8oTxAvKvBU1tg=; b=Gs5JSAax97CTv7mspRhKmJ2BGX8QAo4a6x5nUN5byr+MhnXQSiag9ocA/DjA2aqZtZ 2Fx28YxYk/sfmZqbnNRF6LnZGbEdtSJPfrrUQfH4rWBGXbXVVM38xl3wQvjtn1mSiEyy ebNN26L6kP65vpkvKjIc+1NU1WevRT0H+MBiI= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=GiBJFIxWFKdzMXlw13BrIhIVJ6tGaL8oTxAvKvBU1tg=; b=reJZdosKihz8/m4vryEMxS/gAaoF1RMr+Z3lunJrdGrIym0sJBAI7juV/woye6tHav eZnsbNJewWmmNX1DUlT0thKNIfN1QUu5BAXayIXH7CwNdrsl7YPgQnVUkXPEexvdfIqX TIc+lkMotCJCLOa9rpbTvXLcDqQOgiRBPrG43eGPruyI1QRxrSHSng8AlN1imb4Blx47 v1efHS4BUW5POjwi0myAj/iuwWMLI7YzuI+/T3StmmtvFXDbIrK66HRcMOSBCAZOOuQF M3wHkYgqgxnV8WQm1870fX6wDUM+dBTIz3cectnKBSZR6GyXB3vWPvJ3M87xUXJVmUot hNlQ== X-Gm-Message-State: ALKqPwctmyTBAU+qxkvA+96bSQaou8F4VYwiexsbQzoq5Dg+Hv9m39B9 vfc2/CxNTYfcLtD/qsUH5W0ophDs5LjugLK4zMy7gg== X-Received: by 2002:a24:b08:: with SMTP id 8-v6mr15298420itd.30.1527600618162; Tue, 29 May 2018 06:30:18 -0700 (PDT) MIME-Version: 1.0 Received: by 2002:a6b:4cc:0:0:0:0:0 with HTTP; Tue, 29 May 2018 06:29:57 -0700 (PDT) In-Reply-To: <20180525155437.GE30654@e110439-lin> References: <1527253951-22709-1-git-send-email-vincent.guittot@linaro.org> <1527253951-22709-3-git-send-email-vincent.guittot@linaro.org> <20180525155437.GE30654@e110439-lin> From: Vincent Guittot Date: Tue, 29 May 2018 15:29:57 +0200 Message-ID: Subject: Re: [PATCH v5 02/10] sched/rt: add rt_rq utilization tracking To: Patrick Bellasi Cc: Peter Zijlstra , Ingo Molnar , linux-kernel , "Rafael J. Wysocki" , Juri Lelli , Dietmar Eggemann , Morten Rasmussen , viresh kumar , Valentin Schneider , Quentin Perret Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Patrick, On 25 May 2018 at 17:54, Patrick Bellasi wrote: > On 25-May 15:12, Vincent Guittot wrote: >> schedutil governor relies on cfs_rq's util_avg to choose the OPP when cfs > ^ > only > otherwise, while RT tasks are running we go to max. > >> tasks are running. >> When the CPU is overloaded by cfs and rt tasks, cfs tasks > ^^^^^^^^^^ > I would say we always have the provlem whenever an RT task preempt a > CFS one, even just for few ms, isn't it? The problem only happens when there is not enough time to run all tasks (rt and cfs). If the cfs task is preempted few ms and the main impact is only a delay in its execution but there is still enough time to do cfs jobs (cpu goes back to idle from time to time), there is no "real" problem. At now, it means that it's not a problem as long as the rt task doesn't take more than the margin that schedutil uses to select a frequency : (max freq + max freq >> 2) util /max capacity > >> are preempted by rt tasks and in this case util_avg reflects the remaining >> capacity but not what cfs want to use. In such case, schedutil can select a >> lower OPP whereas the CPU is overloaded. In order to have a more accurate >> view of the utilization of the CPU, we track the utilization that is >> "stolen" by rt tasks. >> >> rt_rq uses rq_clock_task and cfs_rq uses cfs_rq_clock_task but they are >> the same at the root group level, so the PELT windows of the util_sum are >> aligned. >> >> Signed-off-by: Vincent Guittot >> --- >> kernel/sched/fair.c | 15 ++++++++++++++- >> kernel/sched/pelt.c | 23 +++++++++++++++++++++++ >> kernel/sched/pelt.h | 7 +++++++ >> kernel/sched/rt.c | 8 ++++++++ >> kernel/sched/sched.h | 7 +++++++ >> 5 files changed, 59 insertions(+), 1 deletion(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 6390c66..fb18bcc 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -7290,6 +7290,14 @@ static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq) >> return false; >> } >> >> +static inline bool rt_rq_has_blocked(struct rq *rq) >> +{ >> + if (rq->avg_rt.util_avg) > > Should use READ_ONCE? I was expecting that there will be only one read by default but I can add READ_ONCE > >> + return true; >> + >> + return false; > > What about just: > > return READ_ONCE(rq->avg_rt.util_avg); > > ? This function is renamed and extended with others tracking in the following patches so we have to test several values in the function. That's also why there is the if test because additional if test are going to be added > >> +} >> + >> #ifdef CONFIG_FAIR_GROUP_SCHED >> >> static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) >> @@ -7349,6 +7357,10 @@ static void update_blocked_averages(int cpu) >> if (cfs_rq_has_blocked(cfs_rq)) >> done = false; >> } >> + update_rt_rq_load_avg(rq_clock_task(rq), rq, 0); >> + /* Don't need periodic decay once load/util_avg are null */ >> + if (rt_rq_has_blocked(rq)) >> + done = false; >> >> #ifdef CONFIG_NO_HZ_COMMON >> rq->last_blocked_load_update_tick = jiffies; >> @@ -7414,9 +7426,10 @@ static inline void update_blocked_averages(int cpu) >> rq_lock_irqsave(rq, &rf); >> update_rq_clock(rq); >> update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq); >> + update_rt_rq_load_avg(rq_clock_task(rq), rq, 0); >> #ifdef CONFIG_NO_HZ_COMMON >> rq->last_blocked_load_update_tick = jiffies; >> - if (!cfs_rq_has_blocked(cfs_rq)) >> + if (!cfs_rq_has_blocked(cfs_rq) && !rt_rq_has_blocked(rq)) >> rq->has_blocked_load = 0; >> #endif >> rq_unlock_irqrestore(rq, &rf); >> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c >> index e6ecbb2..213b922 100644 >> --- a/kernel/sched/pelt.c >> +++ b/kernel/sched/pelt.c >> @@ -309,3 +309,26 @@ int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq) >> >> return 0; >> } >> + >> +/* >> + * rt_rq: >> + * >> + * util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked >> + * util_sum = cpu_scale * load_sum >> + * runnable_load_sum = load_sum >> + * >> + */ >> + >> +int update_rt_rq_load_avg(u64 now, struct rq *rq, int running) >> +{ >> + if (___update_load_sum(now, rq->cpu, &rq->avg_rt, >> + running, >> + running, >> + running)) { >> + > > Not needed empty line? yes probably. This empty is coming from the copy/paste of __update_load_avg_cfs_rq() I will consolidate this in the next version > >> + ___update_load_avg(&rq->avg_rt, 1, 1); >> + return 1; >> + } >> + >> + return 0; >> +} >> diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h >> index 9cac73e..b2983b7 100644 >> --- a/kernel/sched/pelt.h >> +++ b/kernel/sched/pelt.h >> @@ -3,6 +3,7 @@ >> int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se); >> int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se); >> int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq); >> +int update_rt_rq_load_avg(u64 now, struct rq *rq, int running); >> >> /* >> * When a task is dequeued, its estimated utilization should not be update if >> @@ -38,6 +39,12 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) >> return 0; >> } >> >> +static inline int >> +update_rt_rq_load_avg(u64 now, struct rq *rq, int running) >> +{ >> + return 0; >> +} >> + >> #endif >> >> >> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c >> index ef3c4e6..b4148a9 100644 >> --- a/kernel/sched/rt.c >> +++ b/kernel/sched/rt.c >> @@ -5,6 +5,8 @@ >> */ >> #include "sched.h" >> >> +#include "pelt.h" >> + >> int sched_rr_timeslice = RR_TIMESLICE; >> int sysctl_sched_rr_timeslice = (MSEC_PER_SEC / HZ) * RR_TIMESLICE; >> >> @@ -1572,6 +1574,9 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) >> >> rt_queue_push_tasks(rq); >> >> + update_rt_rq_load_avg(rq_clock_task(rq), rq, >> + rq->curr->sched_class == &rt_sched_class); >> + >> return p; >> } >> >> @@ -1579,6 +1584,8 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p) >> { >> update_curr_rt(rq); >> >> + update_rt_rq_load_avg(rq_clock_task(rq), rq, 1); >> + >> /* >> * The previous task needs to be made eligible for pushing >> * if it is still active >> @@ -2308,6 +2315,7 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued) >> struct sched_rt_entity *rt_se = &p->rt; >> >> update_curr_rt(rq); >> + update_rt_rq_load_avg(rq_clock_task(rq), rq, 1); > > Mmm... not entirely sure... can't we fold > update_rt_rq_load_avg() into update_curr_rt() ? > > Currently update_curr_rt() is used in: > dequeue_task_rt > pick_next_task_rt > put_prev_task_rt > task_tick_rt > > while we update_rt_rq_load_avg() only in: > pick_next_task_rt > put_prev_task_rt > task_tick_rt > and > update_blocked_averages > > Why we don't we need to update at dequeue_task_rt() time ? We are tracking rt rq and not sched entities so we want to know when sched rt will be the running or not sched class on the rq. Tracking dequeue_task_rt is useless > >> >> watchdog(rq, p); >> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >> index 757a3ee..7a16de9 100644 >> --- a/kernel/sched/sched.h >> +++ b/kernel/sched/sched.h >> @@ -592,6 +592,7 @@ struct rt_rq { >> unsigned long rt_nr_total; >> int overloaded; >> struct plist_head pushable_tasks; >> + >> #endif /* CONFIG_SMP */ >> int rt_queued; >> >> @@ -847,6 +848,7 @@ struct rq { >> >> u64 rt_avg; >> u64 age_stamp; >> + struct sched_avg avg_rt; >> u64 idle_stamp; >> u64 avg_idle; >> >> @@ -2205,4 +2207,9 @@ static inline unsigned long cpu_util_cfs(struct rq *rq) >> >> return util; >> } >> + >> +static inline unsigned long cpu_util_rt(struct rq *rq) >> +{ >> + return rq->avg_rt.util_avg; > > READ_ONCE? > >> +} >> #endif >> -- >> 2.7.4 >> > > -- > #include > > Patrick Bellasi