Received: by 10.192.165.148 with SMTP id m20csp3528623imm; Mon, 7 May 2018 14:00:48 -0700 (PDT) X-Google-Smtp-Source: AB8JxZps6EUi+J5lpTOTeRoppKQhClrNs8yGYwNeZ4wlh0K8o7mQciqAi9b1PvR5oJhoeutj0f5M X-Received: by 2002:a6b:bd04:: with SMTP id n4-v6mr33322563iof.285.1525726848420; Mon, 07 May 2018 14:00:48 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1525726848; cv=none; d=google.com; s=arc-20160816; b=KApG290unp+HU4mzDUPG/a4eybG89egR9QoRx5bG2W4XJjtds7KbvnxesmW2M/LdKS jys8Vnk1K96y4c4vMJo3EHsnDWZ3VSOu2iPawe8ec72Ku3pUyQ8EYwzlQEybzY8kmSwG WGclyN7vErBbFigm76EVamfsPcja5JrwoqQJUUJx5xjH9MYaoLf5zIpgr2+o7rLtswCO Woh+2hEbH8WCCqVQ1/H0s15HQeFvtkQ6CgwmEqUYUGORR6QRmKaR5pOeqsMATkjbhZbE IlWQhuIzidP8dqSmwkH8/ZP8usCIRq5R49ac29a2pGmLxUsa/GfZ4eibeNoTogYkBalK PPNQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:date:subject:cc:to:from :dkim-signature:arc-authentication-results; bh=S0BQwmAIYCjbHzwBJ54URIFDKiOPVCBoZ1yMsTvUl0k=; b=XORg2YRFLLiRzZkyw0PBNc8goIT1RjP6mk5uYy1yCI7upikcm1pCrX8Ai1Jw7pl8ti z29vdtsQtIvGoQJC5d+4C1YBTONdHUIAK3vmbgdpklX89n5l9bDkP+XYPxfKCvVI0w8F vy5f1ypautbaf95tlbp06GFu2yLcVS8umnWIqXSCHWByX0RsUrQF3PSgnlEqGviM8rwu H+b0sSwVGFNIqSpXsjnX59uFPdUfHGyQ92ZLaZ2eWCJt4aPO+5HTa/PXwVYhtzG2hokO N3Rg3+MkbmfZ0H4E7Y4SAxNs/bskFgf5ABRUbwXSsBz/BEQBJzOac9EOQ7H2cQ+83kTu hgSQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@cmpxchg.org header.s=x header.b=feYbFH7b; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 136-v6si16234230ion.232.2018.05.07.14.00.34; Mon, 07 May 2018 14:00:48 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=fail header.i=@cmpxchg.org header.s=x header.b=feYbFH7b; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753361AbeEGU77 (ORCPT + 99 others); Mon, 7 May 2018 16:59:59 -0400 Received: from gum.cmpxchg.org ([85.214.110.215]:49462 "EHLO gum.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753108AbeEGU74 (ORCPT ); Mon, 7 May 2018 16:59:56 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=cmpxchg.org ; s=x; h=Message-Id:Date:Subject:Cc:To:From:Sender:Reply-To:MIME-Version: Content-Type:Content-Transfer-Encoding:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: In-Reply-To:References:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=S0BQwmAIYCjbHzwBJ54URIFDKiOPVCBoZ1yMsTvUl0k=; b=feYbFH7bJDCstpUGPJzbF97VTl aeRzbFH+FpFF5/X6k5fW7ziYfXOocEBaO1EI70o5KCZGkSO08RlTD8dn9szj78lJBpi5jtU5IB/oc iGlPBFyxd4DMbPksPbJCMnepOtO3RyTrm+FSqaF8rqMWq5S73iqEgFTqr+XVkS6JBjys=; From: Johannes Weiner To: linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-block@vger.kernel.org, cgroups@vger.kernel.org Cc: Ingo Molnar , Peter Zijlstra , Andrew Morton , Tejun Heo , Balbir Singh , Mike Galbraith , Oliver Yang , Shakeel Butt , xxx xxx , Taras Kondratiuk , Daniel Walker , Vinayak Menon , Ruslan Ruslichenko , kernel-team@fb.com Subject: [PATCH 0/7] psi: pressure stall information for CPU, memory, and IO Date: Mon, 7 May 2018 17:01:28 -0400 Message-Id: <20180507210135.1823-1-hannes@cmpxchg.org> X-Mailer: git-send-email 2.17.0 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, I previously submitted a version of this patch set called "memdelay", which translated delays from reclaim, swap-in, thrashing page cache into a pressure percentage of lost walltime. I've since extended this code to aggregate all delay states tracked by delayacct in order to have generalized pressure/overcommit levels for CPU, memory, and IO. There was feedback from Peter on the previous version that I have incorporated as much as possible and as it still applies to this code: - got rid of the extra lock in the sched callbacks; all task state changes we care about serialize through rq->lock - got rid of ktime_get() inside the sched callbacks and switched time measuring to rq_clock() - got rid of all divisions inside the sched callbacks, tracking everything natively in ns now I also moved this stuff into existing sched/stat.h callbacks, so it doesn't get in the way in sched/core.c, and of course moved the whole thing behind CONFIG_PSI since not everyone is going to want it. Real-world applications Since the last posting, we've begun using the data collected by this code quite extensively at Facebook, and with several success stories. First we used it on systems that frequently locked up in low memory situations. The reason this happens is that the OOM killer is triggered by reclaim not being able to make forward progress, but with fast flash devices there is *always* some clean and uptodate cache to reclaim; the OOM killer never kicks in, even as tasks wait 80-90% of the time faulting executables. There is no situation where this ever makes sense in practice. We wrote a <100 line POC python script to monitor memory pressure and kill stuff manually, way before such pathological thrashing. We've since extended the python script into a more generic oomd that we use all over the place, not just to avoid livelocks but also to guarantee latency and throughput SLAs, since they're usually violated way before the kernel OOM killer would ever kick in. We also use the memory pressure info for loadshedding. Our batch job infrastructure used to refuse new requests on heuristics based on RSS and other existing VM metrics in an attempt to avoid OOM kills and maximize utilization. Since it was still plagued by frequent OOM kills, we switched it to shed load on psi memory pressure, which has turned out to be a much better bellwether, and we managed to reduce OOM kills drastically. Reducing the rate of OOM outages from the worker pool raised its aggregate productivity, and we were able to switch that service to smaller machines. Lastly, we use cgroups to isolate a machine's main workload from maintenance crap like package upgrades, logging, configuration, as well as to prevent multiple workloads on a machine from stepping on each others' toes. We were not able to do this properly without the pressure metrics; we would see latency or bandwidth drops, but it would often be hard to impossible to rootcause it post-mortem. We now log and graph the pressure metrics for all containers in our fleet and can trivially link service drops to resource pressure after the fact. How do you use this? A kernel with CONFIG_PSI=y will create a /proc/pressure directory with 3 files: cpu, memory, and io. If using cgroup2, cgroups will also have cpu.pressure, memory.pressure and io.pressure files, which simply calculate pressure at the cgroup level instead of system-wide. The cpu file contains one line: some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722 The averages give the percentage of walltime in which some tasks are delayed on the runqueue while another task has the CPU. They're recent averages over 10s, 1m, 5m windows, so you can tell short term trends from long term ones, similarly to the load average. What to make of this number? If CPU utilization is at 100% and CPU pressure is 0, it means the system is perfectly utilized, with one runnable thread per CPU and nobody waiting. At two or more runnable tasks per CPU, the system is 100% overcommitted and the pressure average will indicate as much. From a utilization perspective this is a great state of course: no CPU cycles are being wasted, even when 50% of the threads were to go idle (and most workloads do vary). From the perspective of the individual job it's not great, however, and they might do better with more resources. Depending on what your priority is, an elevated "some" number may or may not require action. The memory file contains two lines: some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828 full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258 The some line is the same as for cpu: the time in which at least one task is stalled on the resource. The full line, however, indicates time in which *nobody* is using the CPU productively due to pressure: all non-idle tasks could be waiting on thrashing cache simultaneously. It can also happen when a single reclaimer occupies the CPU, since nothing else can make forward progress during that time. CPU cycles are being wasted. Significant time spent in there is a good trigger for killing, moving jobs to other machines, or dropping incoming requests, since neither the jobs nor the machine overall is making too much headway. The total= value gives the absolute stall time in microseconds. This allows detecting latency spikes that might be too short to sway the running averages. It also allows custom time averaging in case the 10s/1m/5m windows aren't adequate for the usecase (or are too coarse with future hardware). The io file is similar to memory. However, unlike CPU and memory, the block layer doesn't have a concept of hardware contention. We cannot know if the IO a task is waiting on is being performed by the device or whether the device is busy with or slowed down other requests. As a result, we can tell how many CPU cycles go to waste due to IO delays, but we can not identify the competition factor in those delays. These patches are against v4.17-rc4. Documentation/accounting/psi.txt | 73 ++++ Documentation/cgroup-v2.txt | 18 + arch/powerpc/platforms/cell/cpufreq_spudemand.c | 2 +- arch/powerpc/platforms/cell/spufs/sched.c | 9 +- arch/s390/appldata/appldata_os.c | 4 - drivers/cpuidle/governors/menu.c | 4 - fs/proc/loadavg.c | 3 - include/linux/cgroup-defs.h | 4 + include/linux/cgroup.h | 15 + include/linux/delayacct.h | 23 + include/linux/mmzone.h | 1 + include/linux/page-flags.h | 5 +- include/linux/psi.h | 52 +++ include/linux/psi_types.h | 84 ++++ include/linux/sched.h | 10 + include/linux/sched/loadavg.h | 90 +++- include/linux/sched/stat.h | 10 +- include/linux/swap.h | 2 +- include/trace/events/mmflags.h | 1 + include/uapi/linux/taskstats.h | 6 +- init/Kconfig | 20 + kernel/cgroup/cgroup.c | 45 +- kernel/debug/kdb/kdb_main.c | 7 +- kernel/delayacct.c | 15 + kernel/fork.c | 4 + kernel/sched/Makefile | 1 + kernel/sched/core.c | 3 + kernel/sched/loadavg.c | 84 ---- kernel/sched/psi.c | 499 ++++++++++++++++++++++ kernel/sched/sched.h | 166 +++---- kernel/sched/stats.h | 91 +++- mm/compaction.c | 5 + mm/filemap.c | 27 +- mm/huge_memory.c | 1 + mm/memcontrol.c | 2 + mm/migrate.c | 2 + mm/page_alloc.c | 10 + mm/swap_state.c | 1 + mm/vmscan.c | 14 + mm/vmstat.c | 1 + mm/workingset.c | 113 +++-- tools/accounting/getdelays.c | 8 +- 42 files changed, 1279 insertions(+), 256 deletions(-)