Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp1068683ybl; Fri, 10 Jan 2020 11:30:56 -0800 (PST) X-Google-Smtp-Source: APXvYqxBWQ3wSJBgzRw9KH0w3617jBFXf4iMTbWI/32yUu1iCmB+YAFanDCV34MxDuhVaB5dovmr X-Received: by 2002:a9d:67ce:: with SMTP id c14mr3792679otn.106.1578684656748; Fri, 10 Jan 2020 11:30:56 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1578684656; cv=none; d=google.com; s=arc-20160816; b=DuFpgcfhKt99dxbTZHFl3bC2iRPI2vd/B84NqO7W2zqRz18TcKo583EZBG4JnnMhS4 b9fW6lnM3mczKT9/HA89ivImwRqi71xpqQ+0tU10eTKQttq5KMmGtZybnALe4H2Kelc7 RTvxS1D0WeCcrar7xxWTFD9/DLRXFbq6nEAb1Qwk/vFa8iP0v8j80rfjApPolGE730zn EV76NNyNua6uhyrJHdToo+/TySlfl+GEOB6z9FszFSlcquw8l1ZjUickyFSAqjJp9Zyq /Y0GjqYhzTxuU00k6g3eNY9FmwLwnCqS/ulHA30MneohzzzHPLrtlBye8txcgEHQjJxg v0pA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=cmeeR6sFr6oqqNVkXHK8c9y5ayWUIz10dfFpsgcsQj4=; b=f3f4BnQvDmc/kHvRgMcWTN7QB68l+fNauZL3qHKfxM9kiMcHNZAmuOsjvSqcW4LoIx Ubu/GG+UoXgf50TPYPcH9ZgzQSMZpt3cJ7NAAH5PvTzO1WCtlw58fawlBErgBy1CZMOt gBsqm2iJJuEyM8EtUGG+V7vTmWeqsyIHrNte5NmNZwKVKqDOKIHF7pjQ7QbnLxAvbMzK lejBVJq6bPYEVU1Qx/CcI/P+kaBxXBVp9utGPL3p3+0S2tu+a2PoQ/beQLpMDMHdfme1 qU2Owk+HxEDwbu2g2fCAP4FpJAEOe5HrCk5LSWycSvuhX4BiUFkxDkn6TfYl8kOKDZQz Xfgw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cloudflare.com header.s=google header.b=ptrtSz7f; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=cloudflare.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id c2si1853077oto.207.2020.01.10.11.30.45; Fri, 10 Jan 2020 11:30:56 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@cloudflare.com header.s=google header.b=ptrtSz7f; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=cloudflare.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728650AbgAJT2q (ORCPT + 99 others); Fri, 10 Jan 2020 14:28:46 -0500 Received: from mail-qt1-f196.google.com ([209.85.160.196]:39137 "EHLO mail-qt1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728451AbgAJT2p (ORCPT ); Fri, 10 Jan 2020 14:28:45 -0500 Received: by mail-qt1-f196.google.com with SMTP id e5so2933507qtm.6 for ; Fri, 10 Jan 2020 11:28:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloudflare.com; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=cmeeR6sFr6oqqNVkXHK8c9y5ayWUIz10dfFpsgcsQj4=; b=ptrtSz7foFY8FAz65+d1QyBOqa69MXwzFssgCSf6ATmws/lNJfiY2a9JFpGX2LhLca gCmfPoVT4ezuJMGj3ig7EiO5zns79taKmDXzt+QHbwp2WuUTSgLORwGIUNCS1ERZJEZn vr+VuphxycLfaJCYG5H1yD0qi36MQQbnFJf9M= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=cmeeR6sFr6oqqNVkXHK8c9y5ayWUIz10dfFpsgcsQj4=; b=VzN8MZz+u1Hmlcph+507Lk+Fme2KvODz9H51xDnk7O3q57AE1FzdIpnFBnFG4qSVpk ANsTt5ZrIOVlAXM7gCZWvhy36qz2SUfZJ48BqK0Rujz7QHfyjdjsaO/1XHFBSzpT28u1 5sGw/mOL7huwTT8brrmGJ2brRpg/kCOmroiWfmPTAIBOuhSRpfITmVKLFOaLi9w2yoaB MGudC6BxlkgUcmvGIuy7RoIanZLTXULCj5LwXhDz/WCtJBcIeL1Rh4D2te2U3znGMXH2 cw1wEv4msrnG+Yrdlxs4SWig5MPiKi15DUHL4KWhvlFIQk+ZuR2Jw8Ytmsv0wtTz7YUc Cmpw== X-Gm-Message-State: APjAAAXAl99nBF9ZAhkEGsvUrztz0lESiHWnSZYbS7/PmGYVQ0aFCVYx RrvL/jhvFPQttL6Kwo4VhVuu9vFskLz4Nl4oC0eU7w== X-Received: by 2002:ac8:187b:: with SMTP id n56mr22064qtk.173.1578684524187; Fri, 10 Jan 2020 11:28:44 -0800 (PST) MIME-Version: 1.0 References: <20200109161632.GB8547@cmpxchg.org> In-Reply-To: <20200109161632.GB8547@cmpxchg.org> From: Ivan Babrou Date: Fri, 10 Jan 2020 11:28:32 -0800 Message-ID: Subject: Re: Lower than expected CPU pressure in PSI To: Johannes Weiner Cc: linux-kernel , kernel-team , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org I applied the patch on top of 5.5.0-rc3 and it's definitely better now, both competing cgroups report 500ms/s delay. Feel free to add Tested-by from me. I'm still seeing /unified/system.slice at 385ms/s and /unified.slice at 372ms/s, do you have an explanation for that part? Maybe it's totally reasonable, but warrants a patch for documentation. On Thu, Jan 9, 2020 at 8:16 AM Johannes Weiner wrote: > > On Wed, Jan 08, 2020 at 11:47:10AM -0800, Ivan Babrou wrote: > > We added reporting for PSI in cgroups and results are somewhat surprising. > > > > My test setup consists of 3 services: > > > > * stress-cpu1-no-contention.service : taskset -c 1 stress --cpu 1 > > * stress-cpu2-first-half.service : taskset -c 2 stress --cpu 1 > > * stress-cpu2-second-half.service : taskset -c 2 stress --cpu 1 > > > > First service runs unconstrained, the other two compete for CPU. > > > > As expected, I can see 500ms/s sched delay for the latter two and > > aggregated 1000ms/s delay for /system.slice, no surprises here. > > > > However, CPU pressure reported by PSI says that none of my services > > have any pressure on them. I can see around 434ms/s pressure on > > /unified/system.slice and 425ms/s pressure on /unified cgroup, which > > is surprising for three reasons: > > > > * Pressure is absent for my services (I expect it to match scheed delay) > > * Pressure on /unified/system.slice is lower than both 500ms/s and 1000ms/s > > * Pressure on root cgroup is lower than on system.slice > > CPU pressure is currently implemented based only on the number of > *runnable* tasks, not on who gets to actively use the CPU. This works > for contention within cgroups or at the global scope, but it doesn't > correctly reflect competition between cgroups. It also doesn't show > the effects of e.g. cpu cycle limiting through cpu.max where there > might *be* only one runnable task, but it's not getting the CPU. > > I've been working on fixing this, but hadn't gotten around to sending > the patch upstream. Attaching it below. Would you mind testing it? > > Peter, what would you think of the below? > > --- > From 98c233aae05b7d43e465d4c382a3a20905235296 Mon Sep 17 00:00:00 2001 > From: Johannes Weiner > Date: Thu, 14 Nov 2019 15:53:45 -0500 > Subject: [PATCH] psi: fix cpu.pressure for cpu.max and competing cgroups > > For simplicity, cpu pressure is defined as having more than one > runnable task on a given CPU. This works on the system-level, but it > has limitations in a cgrouped reality: When cpu.max is in use, it > doesn't capture the time in which a task is not executing on the CPU > due to throttling. Likewise, it doesn't capture the time in which a > competing cgroup is occupying the CPU - meaning it only reflects > cgroup-internal competitive pressure, not outside pressure. > > Enable tracking of currently executing tasks, and then change the > definition of cpu pressure in a cgroup from > > NR_RUNNING > 1 > > to > > NR_RUNNING > ON_CPU > > which will capture the effects of cpu.max as well as competition from > outside the cgroup. > > After this patch, a cgroup running `stress -c 1` with a cpu.max > setting of 5000 10000 shows ~50% continuous CPU pressure. > > Signed-off-by: Johannes Weiner > --- > include/linux/psi_types.h | 10 +++++++++- > kernel/sched/core.c | 2 ++ > kernel/sched/psi.c | 12 +++++++----- > kernel/sched/stats.h | 28 ++++++++++++++++++++++++++++ > 4 files changed, 46 insertions(+), 6 deletions(-) > > diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h > index 07aaf9b82241..4b7258495a04 100644 > --- a/include/linux/psi_types.h > +++ b/include/linux/psi_types.h > @@ -14,13 +14,21 @@ enum psi_task_count { > NR_IOWAIT, > NR_MEMSTALL, > NR_RUNNING, > - NR_PSI_TASK_COUNTS = 3, > + /* > + * This can't have values other than 0 or 1 and could be > + * implemented as a bit flag. But for now we still have room > + * in the first cacheline of psi_group_cpu, and this way we > + * don't have to special case any state tracking for it. > + */ > + NR_ONCPU, > + NR_PSI_TASK_COUNTS = 4, > }; > > /* Task state bitmasks */ > #define TSK_IOWAIT (1 << NR_IOWAIT) > #define TSK_MEMSTALL (1 << NR_MEMSTALL) > #define TSK_RUNNING (1 << NR_RUNNING) > +#define TSK_ONCPU (1 << NR_ONCPU) > > /* Resources that workloads could be stalled on */ > enum psi_res { > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 44123b4d14e8..9296e0de7b72 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -4072,6 +4072,8 @@ static void __sched notrace __schedule(bool preempt) > */ > ++*switch_count; > > + psi_sched_switch(prev, next, !task_on_rq_queued(prev)); > + > trace_sched_switch(preempt, prev, next); > > /* Also unlocks the rq: */ > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c > index 517e3719027e..fad91da54aab 100644 > --- a/kernel/sched/psi.c > +++ b/kernel/sched/psi.c > @@ -224,7 +224,7 @@ static bool test_state(unsigned int *tasks, enum psi_states state) > case PSI_MEM_FULL: > return tasks[NR_MEMSTALL] && !tasks[NR_RUNNING]; > case PSI_CPU_SOME: > - return tasks[NR_RUNNING] > 1; > + return tasks[NR_RUNNING] > tasks[NR_ONCPU]; > case PSI_NONIDLE: > return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] || > tasks[NR_RUNNING]; > @@ -694,10 +694,10 @@ static u32 psi_group_change(struct psi_group *group, int cpu, > if (!(m & (1 << t))) > continue; > if (groupc->tasks[t] == 0 && !psi_bug) { > - printk_deferred(KERN_ERR "psi: task underflow! cpu=%d t=%d tasks=[%u %u %u] clear=%x set=%x\n", > + printk_deferred(KERN_ERR "psi: task underflow! cpu=%d t=%d tasks=[%u %u %u %u] clear=%x set=%x\n", > cpu, t, groupc->tasks[0], > groupc->tasks[1], groupc->tasks[2], > - clear, set); > + groupc->tasks[3], clear, set); > psi_bug = 1; > } > groupc->tasks[t]--; > @@ -915,9 +915,11 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to) > > rq = task_rq_lock(task, &rf); > > - if (task_on_rq_queued(task)) > + if (task_on_rq_queued(task)) { > task_flags = TSK_RUNNING; > - else if (task->in_iowait) > + if (task_current(rq, task)) > + task_flags |= TSK_ONCPU; > + } else if (task->in_iowait) > task_flags = TSK_IOWAIT; > > if (task->flags & PF_MEMSTALL) > diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h > index ba683fe81a6e..6ff0ac1a803f 100644 > --- a/kernel/sched/stats.h > +++ b/kernel/sched/stats.h > @@ -93,6 +93,14 @@ static inline void psi_dequeue(struct task_struct *p, bool sleep) > if (p->flags & PF_MEMSTALL) > clear |= TSK_MEMSTALL; > } else { > + /* > + * When a task sleeps, schedule() dequeues it before > + * switching to the next one. Merge the clearing of > + * TSK_RUNNING and TSK_ONCPU to save an unnecessary > + * psi_task_change() call in psi_sched_switch(). > + */ > + clear |= TSK_ONCPU; > + > if (p->in_iowait) > set |= TSK_IOWAIT; > } > @@ -126,6 +134,23 @@ static inline void psi_ttwu_dequeue(struct task_struct *p) > } > } > > +static inline void psi_sched_switch(struct task_struct *prev, > + struct task_struct *next, > + bool sleep) > +{ > + if (static_branch_likely(&psi_disabled)) > + return; > + > + /* > + * Clear the TSK_ONCPU state if the task was preempted. If > + * it's a voluntary sleep, dequeue will have taken care of it. > + */ > + if (!sleep) > + psi_task_change(prev, TSK_ONCPU, 0); > + > + psi_task_change(next, 0, TSK_ONCPU); > +} > + > static inline void psi_task_tick(struct rq *rq) > { > if (static_branch_likely(&psi_disabled)) > @@ -138,6 +163,9 @@ static inline void psi_task_tick(struct rq *rq) > static inline void psi_enqueue(struct task_struct *p, bool wakeup) {} > static inline void psi_dequeue(struct task_struct *p, bool sleep) {} > static inline void psi_ttwu_dequeue(struct task_struct *p) {} > +static inline void psi_sched_switch(struct task_struct *prev, > + struct task_struct *next, > + bool sleep) {} > static inline void psi_task_tick(struct rq *rq) {} > #endif /* CONFIG_PSI */ > > -- > 2.24.1