Received: by 2002:ac0:a5a7:0:0:0:0:0 with SMTP id m36-v6csp1272996imm; Fri, 13 Jul 2018 15:13:58 -0700 (PDT) X-Google-Smtp-Source: AAOMgpc/oeKqjnFdNAPxJ84laYTL4zkZNMCubo+4s3DY3WDNAO9gdIOTvpTSGXZfu7hBhOdxp9kq X-Received: by 2002:a63:743:: with SMTP id 64-v6mr7876158pgh.216.1531520038434; Fri, 13 Jul 2018 15:13:58 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1531520038; cv=none; d=google.com; s=arc-20160816; b=vb9yA8OEps+RBDdWEzsM78G73kkzZ4jti204UhffaCAAJPZvfoOl43F2IkHyYhx0Y5 k9JeDYVQT+QGuxYlYS7Qua9accB1GgJikT5qBM6YQ+vtfitIyRIvpvNbOlVSfzo0nEqH wJpEMMqVMWnXQ1W0twKLPkZYruSfVeD+IG89sDYIML3eBBUq/2/lE2e8dpW+kGTMqBAW gqRfGUjLq4b2AnTQQjtFUYl0vqzA0UfX/lWzLoxGaNNXISu+Forj6k1gAIzPPY8qEoef ntd1dNUGjk4JZP8ipsD9EkuZy+9Xm8vjDRyx/yT69dGS7rbcDeBEd0fDwLNB0Q1rBaWC 5Ezw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :references:in-reply-to:mime-version:dkim-signature :arc-authentication-results; bh=+ObCFiWtKHdcDs6G/+KtP9iAagmyEmCpYKhsqTihIbY=; b=mNRViaiTrzvgIMHrxQiw63QZK3drs6CMe4yiu5Si4teQTxtfyET7Rt50StXVQpJqYN LiscuEUxro/EckRrd4Se1ZhQb88xIC/C3LBxiLpxwcYGRSeIPiQAW3O4kJ7CnnJYC+O0 Yy3/qPjXss7u/Y8M+pevyV52+3zxk4+LW9hYFd+OLG0dsF/mIN2D0xv3/RRjgGjhRWXz /WOgYya5oC3zAMD+K+I3T9TP0APrO9GmsqEjSsuzWDTBchya60pM/GM6H7E18Y82IUM9 c/m0/eW0cqp4UvjHWagscdD8LZ3c3FOhp2iz/C3EUfNAY5xQwBitlctot1lHfNmrPogA 77jg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b="B/14/+5q"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id r22-v6si6985192pls.37.2018.07.13.15.13.43; Fri, 13 Jul 2018 15:13:58 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b="B/14/+5q"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729725AbeGMW3l (ORCPT + 99 others); Fri, 13 Jul 2018 18:29:41 -0400 Received: from mail-it0-f68.google.com ([209.85.214.68]:33612 "EHLO mail-it0-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726126AbeGMW3l (ORCPT ); Fri, 13 Jul 2018 18:29:41 -0400 Received: by mail-it0-f68.google.com with SMTP id y124-v6so6045789itc.0 for ; Fri, 13 Jul 2018 15:13:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=+ObCFiWtKHdcDs6G/+KtP9iAagmyEmCpYKhsqTihIbY=; b=B/14/+5qoAU5RR44IOb1DSlFz7m+WLZx/TTRT6wq6NNz8Z34hv1f/RmZR/1J8QXcPJ zPtMcKFOMAHF0fqyPdz4gLjCP4wePirhXlk7bKlAgSJhDZmS2FSF61k4O63FRLOMnvLx 25cu+lfT95zlyn4QmRPcllvpRy9zkocYGs3eywUbHlLnnUFCmMl+Kc+M+6W4ZM4dcrcB EAhq1q6zQ1Jp1tEv1E+G4oe21hIXUrKOhycsujxHEQdLZoXOVGEZR0v4xKXq2HY68k09 cIuHNMQALcQyfr8XKD2PEWqmcruUkOESA6txkdniwgsBkSHuxzkz8FSnYeQRDJfdKQAz S1oA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=+ObCFiWtKHdcDs6G/+KtP9iAagmyEmCpYKhsqTihIbY=; b=g5PSEwan6VbBt1MCS/tmkpdYz5lz+D2VDtThj6ECGQJCVVZDsOlrKI7+gCAi7Nknut y4F9RLJZYsJMvmx87NafdQG2ItxU78aDekPe+mUFRCgM0kCtitmG91KQvPT0+tQY4t+z JyFC7z70qVjondbaCqtCFZTDz49PKXloQH8iKC5I3URitAbZRpZwv9o+0SSa3VTwcJzu eVXT0tiwR3HlLkeKRWr/FDc7qHRjQXYGNtM2Gsg/Jnhenk/fKMEGZCZ1mXx10kg95luM 4wn8VrelFMPCVb8s9XBDkaNqHa5jx23CDm/kjQB1v98zORqIn0+Mv0kNDJg6gYMQM8bH HPpg== X-Gm-Message-State: AOUpUlHMLBujnvMYz3Eff4XinaVyCtaeMUTdfqA3QBMiT0XDcBS1wOaD X+Hx8mGekFdPX84ImrH6bqt1PqULOQpNh9TGxcTRwQ== X-Received: by 2002:a24:598f:: with SMTP id p137-v6mr6497418itb.5.1531519987651; Fri, 13 Jul 2018 15:13:07 -0700 (PDT) MIME-Version: 1.0 Received: by 2002:ac0:e445:0:0:0:0:0 with HTTP; Fri, 13 Jul 2018 15:13:07 -0700 (PDT) In-Reply-To: <20180712172942.10094-11-hannes@cmpxchg.org> References: <20180712172942.10094-1-hannes@cmpxchg.org> <20180712172942.10094-11-hannes@cmpxchg.org> From: Suren Baghdasaryan Date: Fri, 13 Jul 2018 15:13:07 -0700 Message-ID: Subject: Re: [RFC PATCH 10/10] psi: aggregate ongoing stall events when somebody reads pressure To: Johannes Weiner Cc: Ingo Molnar , Peter Zijlstra , Andrew Morton , Linus Torvalds , Tejun Heo , Vinayak Menon , Christopher Lameter , Mike Galbraith , Shakeel Butt , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jul 12, 2018 at 10:29 AM, Johannes Weiner wrote: > Right now, psi reports pressure and stall times of already concluded > stall events. For most use cases this is current enough, but certain > highly latency-sensitive applications, like the Android OOM killer, to be more precise, it's Android LMKD (low memory killer daemon) not to be confused with kernel OOM killer. > might want to know about and react to stall states before they have > even concluded (e.g. a prolonged reclaim cycle). > > This patches the procfs/cgroupfs interface such that when the pressure > metrics are read, the current per-cpu states, if any, are taken into > account as well. > > Any ongoing states are concluded, their time snapshotted, and then > restarted. This requires holding the rq lock to avoid corruption. It > could use some form of rq lock ratelimiting or avoidance. > > Requested-by: Suren Baghdasaryan > Not-yet-signed-off-by: Johannes Weiner > --- IMHO this description is a little difficult to understand. In essence, PSI information is being updated periodically every 2secs and without this patch the data can be stale at the time when we read it (because it was last updated up to 2secs ago). To avoid this we update the PSI "total" values when data is being read. > kernel/sched/psi.c | 56 +++++++++++++++++++++++++++++++++++++--------- > 1 file changed, 46 insertions(+), 10 deletions(-) > > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c > index 53e0b7b83e2e..5a6c6057f775 100644 > --- a/kernel/sched/psi.c > +++ b/kernel/sched/psi.c > @@ -190,7 +190,7 @@ static void calc_avgs(unsigned long avg[3], u64 time, int missed_periods) > } > } > > -static bool psi_update_stats(struct psi_group *group) > +static bool psi_update_stats(struct psi_group *group, bool ondemand) > { > u64 some[NR_PSI_RESOURCES] = { 0, }; > u64 full[NR_PSI_RESOURCES] = { 0, }; > @@ -200,8 +200,6 @@ static bool psi_update_stats(struct psi_group *group) > int cpu; > int r; > > - mutex_lock(&group->stat_lock); > - > /* > * Collect the per-cpu time buckets and average them into a > * single time sample that is normalized to wallclock time. > @@ -218,10 +216,36 @@ static bool psi_update_stats(struct psi_group *group) > for_each_online_cpu(cpu) { > struct psi_group_cpu *groupc = per_cpu_ptr(group->cpus, cpu); > unsigned long nonidle; > + struct rq_flags rf; > + struct rq *rq; > + u64 now; > > - if (!groupc->nonidle_time) > + if (!groupc->nonidle_time && !groupc->nonidle) > continue; > > + /* > + * We come here for two things: 1) periodic per-cpu > + * bucket flushing and averaging and 2) when the user > + * wants to read a pressure file. For flushing and > + * averaging, which is relatively infrequent, we can > + * be lazy and tolerate some raciness with concurrent > + * updates to the per-cpu counters. However, if a user > + * polls the pressure state, we want to give them the > + * most uptodate information we have, including any > + * currently active state which hasn't been timed yet, > + * because in case of an iowait or a reclaim run, that > + * can be significant. > + */ > + if (ondemand) { > + rq = cpu_rq(cpu); > + rq_lock_irq(rq, &rf); > + > + now = cpu_clock(cpu); > + > + groupc->nonidle_time += now - groupc->nonidle_start; > + groupc->nonidle_start = now; > + } > + > nonidle = nsecs_to_jiffies(groupc->nonidle_time); > groupc->nonidle_time = 0; > nonidle_total += nonidle; > @@ -229,13 +253,27 @@ static bool psi_update_stats(struct psi_group *group) > for (r = 0; r < NR_PSI_RESOURCES; r++) { > struct psi_resource *res = &groupc->res[r]; > > + if (ondemand && res->state != PSI_NONE) { > + bool is_full = res->state == PSI_FULL; > + > + res->times[is_full] += now - res->state_start; > + res->state_start = now; > + } > + > some[r] += (res->times[0] + res->times[1]) * nonidle; > full[r] += res->times[1] * nonidle; > > - /* It's racy, but we can tolerate some error */ > res->times[0] = 0; > res->times[1] = 0; > } > + > + if (ondemand) > + rq_unlock_irq(rq, &rf); > + } > + > + for (r = 0; r < NR_PSI_RESOURCES; r++) { > + do_div(some[r], max(nonidle_total, 1UL)); > + do_div(full[r], max(nonidle_total, 1UL)); > } > > /* > @@ -249,12 +287,10 @@ static bool psi_update_stats(struct psi_group *group) > * activity, thus no data, and clock ticks are sporadic. The > * below handles both. > */ > + mutex_lock(&group->stat_lock); > > /* total= */ > for (r = 0; r < NR_PSI_RESOURCES; r++) { > - do_div(some[r], max(nonidle_total, 1UL)); > - do_div(full[r], max(nonidle_total, 1UL)); > - > group->some[r] += some[r]; > group->full[r] += full[r]; > } > @@ -301,7 +337,7 @@ static void psi_clock(struct work_struct *work) > * go - see calc_avgs() and missed_periods. > */ > > - nonidle = psi_update_stats(group); > + nonidle = psi_update_stats(group, false); > > if (nonidle) { > unsigned long delay = 0; > @@ -570,7 +606,7 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res) > if (psi_disabled) > return -EOPNOTSUPP; > > - psi_update_stats(group); > + psi_update_stats(group, true); > > for (w = 0; w < 3; w++) { > avg[0][w] = group->avg_some[res][w]; > -- > 2.18.0 >