Received: by 2002:ac0:a5a7:0:0:0:0:0 with SMTP id m36-v6csp6250317imm; Mon, 23 Jul 2018 14:23:15 -0700 (PDT) X-Google-Smtp-Source: AAOMgpcPsUsKfp9lC3YZCplg6ujAI7nRH+5LbPxMY5uKA9uVO4RZ4VGPE3TCkcNdcibtc611/ZZ4 X-Received: by 2002:a63:5660:: with SMTP id g32-v6mr13887072pgm.227.1532380995000; Mon, 23 Jul 2018 14:23:15 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1532380994; cv=none; d=google.com; s=arc-20160816; b=HAQKDOcE/HeSVNmNjJgzgYTBBZk18EgggejFAHwR+8R4pIWGcbxsp8tacrCj3UC9c0 vT6Yn/VoEZHF8hIE5EdtFyEHetdkN7MAXinyshaBJsnlFeoBph9hnw6fAmhC2Ky4jUmv NEU2fonBnOfwAoutUbhuufuUTgwmIsuQjwbC4Obs3XPSJrA1eGDOkpw1NttAxD3EOLVW DHJLqZ54IlNptGaqNoMV6xhzaAFM16XnctcU+QSLIMKIuziLCW6VFZHd9/CCWH35+rFL bIOIiuEFLEtJlPqdnc/8iXwWqhYW5YFbcx3LXV6C4ExJZo/iQwOqSP5wtjs+/WT3I6Lp u0PQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature :arc-authentication-results; bh=RWwxIEnBoVISPE9+JYRAjl8u3ujEBTaXxzWNyTN3g6Y=; b=Fk+NR+QEAVhSBTchwogN4AoEfnZ7xnUpU0mb3T31Xp4h6ASgQtE8bfkLG7MdK4VMpA EHFEp1GdTGPj6h8JPhB0zP+mcXF+vbyo4s1phmZc2kTY2ATz8CLB77RywNm3EZ96flx/ JqFiFZ7myOukp6dsC3J6mB/yCxBI1jNHe+ILw4wfCLY/+qd+2UNkJ88x7Q/qdAknnCEh W/Gt+ocSxT972DrIhV0f02pkJbJAmitfdbFDLFWbXjEEnmkMmJbLQbADOyyzGSvAUPgF 06w063MCseu5PYR5SgEAeOyBbl4Tmim2ONFGhI+DJm7ZK+BmWbKMfu3+ecwMy3fobnVm 3BYA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=vRxecaJ1; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id a11-v6si8671722plp.108.2018.07.23.14.22.59; Mon, 23 Jul 2018 14:23:14 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=vRxecaJ1; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2388159AbeGWWZN (ORCPT + 99 others); Mon, 23 Jul 2018 18:25:13 -0400 Received: from mail-vk0-f65.google.com ([209.85.213.65]:43449 "EHLO mail-vk0-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2388119AbeGWWZN (ORCPT ); Mon, 23 Jul 2018 18:25:13 -0400 Received: by mail-vk0-f65.google.com with SMTP id s17-v6so1038487vke.10; Mon, 23 Jul 2018 14:22:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=RWwxIEnBoVISPE9+JYRAjl8u3ujEBTaXxzWNyTN3g6Y=; b=vRxecaJ10da3n7xAZYgprobXGgLBXQprbzt0kLF0Vtundt1t3iXqmkeuOTV8n7TKrP XjD7oiC3ocTsqTvEX+IOziMXW6d5H6UDciHo/Myt02cUychJFBTmKrNNuOo8au3dkT/V kVFd58NHXUFfGNXQHMy7wp6c6+t6U14lU+pVY2gCzwumbcmBrcg25GNj7oUZiCR6H3dO sQWpF2ZxkVPBkh6ZvRPvX2dUx+nRIeYUvxNfaZjbWTGcsJ7eagCjwCjWeEynOkrrV451 RBcas2nfnTcTGA05HTW62PzQ68yg3WX8wo0EiMTLWKSvE2DIzKvQhBqa9a+vBFZHz+HK baXA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=RWwxIEnBoVISPE9+JYRAjl8u3ujEBTaXxzWNyTN3g6Y=; b=uLjFPEZFy29GzsijBhDxrhF1neHc+MGzLcrlDlwVpRE3i58pJYswhQho1VkLWPfDo3 60hDM4f3KDRnE2Y50uIlURtmXid2STxXUjA8XmOQrU5jq7T2YdoTAXCR1GSaMDsXIsFu tjqOOmq62gKpcGYmGX4fSCtJXmom3x0POcSWq3/K0swyviniuDukbducAYnyIN+CaNm2 z0BSIyM9UNuAWX64jvuVJc1pZn1AkRQgx5F2kqu1BEtHPxLlF83hyGgjxbJCPDCau2Je hKxTghkCd8ea+T05Hn05rAngUlHtx8/N6HC7ReF6AS6rtPw6Z1M85vjMB0ZXWqZIYneD 8+jg== X-Gm-Message-State: AOUpUlEXxCO0aGMvqJCcEyN0i6m/+PtlcjpEx0Aa9LvDcuqUUQDa61kl liGENOCtbdE6yrSkGdTjyoJtL03uSIkbehgJxqs= X-Received: by 2002:a1f:d744:: with SMTP id o65-v6mr8546715vkg.101.1532380453746; Mon, 23 Jul 2018 14:14:13 -0700 (PDT) MIME-Version: 1.0 References: <20180712172942.10094-1-hannes@cmpxchg.org> In-Reply-To: <20180712172942.10094-1-hannes@cmpxchg.org> From: Balbir Singh Date: Tue, 24 Jul 2018 07:14:02 +1000 Message-ID: Subject: Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2 To: Johannes Weiner Cc: Ingo Molnar , Peter Zijlstra , "akpm@linux-foundation.org" , Linus Torvalds , Tejun Heo , surenb@google.com, Vinayak Menon , Christoph Lameter , Mike Galbraith , Shakeel Butt , linux-mm , cgroups@vger.kernel.org, "linux-kernel@vger.kernel.org" , kernel-team@fb.com Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jul 13, 2018 at 3:27 AM Johannes Weiner wrote: > > PSI aggregates and reports the overall wallclock time in which the > tasks in a system (or cgroup) wait for contended hardware resources. > > This helps users understand the resource pressure their workloads are > under, which allows them to rootcause and fix throughput and latency > problems caused by overcommitting, underprovisioning, suboptimal job > placement in a grid, as well as anticipate major disruptions like OOM. > > This version 2 of the series incorporates a ton of feedback from > PeterZ and SurenB; more details at the end of this email. > > Real-world applications > > We're using the data collected by psi (and its previous incarnation, > memdelay) quite extensively at Facebook, with several success stories. > > One usecase is avoiding OOM hangs/livelocks. The reason these happen > is because the OOM killer is triggered by reclaim not being able to > free pages, but with fast flash devices there is *always* some clean > and uptodate cache to reclaim; the OOM killer never kicks in, even as > tasks spend 90% of the time thrashing the cache pages of their own > executables. There is no situation where this ever makes sense in > practice. We wrote a <100 line POC python script to monitor memory > pressure and kill stuff way before such pathological thrashing leads > to full system losses that require forcible hard resets. > > We've since extended and deployed this code into other places to > guarantee latency and throughput SLAs, since they're usually violated > way before the kernel OOM killer would ever kick in. > > The idea is to eventually incorporate this back into the kernel, so > that Linux can avoid OOM livelocks (which TECHNICALLY aren't memory > deadlocks, but for the user indistinguishable) out of the box. > > We also use psi memory pressure for loadshedding. Our batch job > infrastructure used to use heuristics based on various VM stats to > anticipate OOM situations, with lackluster success. We switched it to > psi and managed to anticipate and avoid OOM kills and hangs fairly > reliably. The reduction of OOM outages in the worker pool raised the > pool's aggregate productivity, and we were able to switch that service > to smaller machines. > > Lastly, we use cgroups to isolate a machine's main workload from > maintenance crap like package upgrades, logging, configuration, as > well as to prevent multiple workloads on a machine from stepping on > each others' toes. We were not able to configure this properly without > the pressure metrics; we would see latency or bandwidth drops, but it > would often be hard to impossible to rootcause it post-mortem. > > We now log and graph pressure for the containers in our fleet and can > trivially link latency spikes and throughput drops to shortages of > specific resources after the fact, and fix the job config/scheduling. > > I've also recieved feedback and feature requests from Android for the > purpose of low-latency OOM killing. The on-demand stats aggregation in > the last patch of this series is for this purpose, to allow Android to > react to pressure before the system starts visibly hanging. > > How do you use this feature? > > A kernel with CONFIG_PSI=y will create a /proc/pressure directory with > 3 files: cpu, memory, and io. If using cgroup2, cgroups will also have > cpu.pressure, memory.pressure and io.pressure files, which simply > aggregate task stalls at the cgroup level instead of system-wide. > > The cpu file contains one line: > > some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722 > > The averages give the percentage of walltime in which one or more > tasks are delayed on the runqueue while another task has the > CPU. They're recent averages over 10s, 1m, 5m windows, so you can tell > short term trends from long term ones, similarly to the load average. > Does the mechanism scale? I am a little concerned about how frequently this infrastructure is monitored/read/acted upon. Why aren't existing mechanisms sufficient -- why is the avg delay calculation in the kernel? > The total= value gives the absolute stall time in microseconds. This > allows detecting latency spikes that might be too short to sway the > running averages. It also allows custom time averaging in case the > 10s/1m/5m windows aren't adequate for the usecase (or are too coarse > with future hardware). > > What to make of this "some" metric? If CPU utilization is at 100% and > CPU pressure is 0, it means the system is perfectly utilized, with one > runnable thread per CPU and nobody waiting. At two or more runnable > tasks per CPU, the system is 100% overcommitted and the pressure > average will indicate as much. From a utilization perspective this is > a great state of course: no CPU cycles are being wasted, even when 50% > of the threads were to go idle (as most workloads do vary). From the > perspective of the individual job it's not great, however, and they > would do better with more resources. Depending on what your priority > and options are, raised "some" numbers may or may not require action. > > The memory file contains two lines: > > some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828 > full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258 > > The some line is the same as for cpu, the time in which at least one > task is stalled on the resource. In the case of memory, this includes > waiting on swap-in, page cache refaults and page reclaim. > > The full line, however, indicates time in which *nobody* is using the > CPU productively due to pressure: all non-idle tasks are waiting for > memory in one form or another. Significant time spent in there is a > good trigger for killing things, moving jobs to other machines, or > dropping incoming requests, since neither the jobs nor the machine > overall are making too much headway. > > The io file is similar to memory. Because the block layer doesn't have > a concept of hardware contention right now (how much longer is my IO > request taking due to other tasks?), it reports CPU potential lost on > all IO delays, not just the potential lost due to competition. > There is no talk about the overhead this introduces in general, may be the details are in the patches. I'll read through them Balbir Singh.