Received: by 2002:a25:8b12:0:0:0:0:0 with SMTP id i18csp5180634ybl; Tue, 27 Aug 2019 00:16:48 -0700 (PDT) X-Google-Smtp-Source: APXvYqwPfCQHzdjt8eSTU0MoY+MzvfjfC+TmvSalHJZDw5NN+iUBMEh8eJrvcUVEa7/xbAin8haU X-Received: by 2002:a17:902:8a93:: with SMTP id p19mr23275690plo.106.1566890207927; Tue, 27 Aug 2019 00:16:47 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1566890207; cv=none; d=google.com; s=arc-20160816; b=Xzdi/GcLi50jYwONMUugUsRcDyX6UFUwxVKALsv2PW2HkWUlk4Sid5UhphElM8Lu0z fowF5dLm79bgf1M0BmF3z0XNs7ap7Nvapnld/JKpSo6czfXzDw89AjwEFd0MwmEHnNJf atvcFGIyhjgiZk3yx3Vyg2Lv1AxnPJDES7PXXdsaIjukfISYdRbtzOG1chW0LUKY20l1 6h8EGnL5hJwKPyLQfeDvkeBWRUmDRrMdngS0i4PdRLCP4XHfQyktu9k3GYYxBTzesyVt CaNPX22LajnfiRuRRjEN7G7TFbxi8tHylYMtwH4/sBTyvE/IyKsycMazC3QBwu3PuJCh QhzQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=Hl+bus1eDRVKKOs0WlPgYl+CG3J+hsdBKGxYpkNC2DI=; b=Dx9HBEgHB1xWamkrASYBCfXsoLj7eAX7IT/+KLxuy8X/RavTCSOhQ0JEqrRv/QUotW JS9yl3zGjo+swJ6rnpENwRi0LDtF4OUjASLAFUO4BQE8/Mh06grkU+mWnoIIH5AE3FbU GxiGMNSqEBPjIRGaS2fRbjeJ266hLP1JBTTZR6fcJRIkbfBLYatWQkwckcVtKUd5h2lf aZUUk2XlACR3H4iIUrqhr/kWrQJmKWVczWdSpDvij93mw59BmNcDukZUbqDzmnf6RFgo DbWEy2iwwtcAzpDkT0dymt0rqI1yWRfyhwkpSOi7rqdaseW6I2fxuiV34tswo5TrIfij v3xw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id b19si1773092pjo.55.2019.08.27.00.16.32; Tue, 27 Aug 2019 00:16:47 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728729AbfH0HP2 (ORCPT + 99 others); Tue, 27 Aug 2019 03:15:28 -0400 Received: from mx2.suse.de ([195.135.220.15]:39760 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726278AbfH0HP1 (ORCPT ); Tue, 27 Aug 2019 03:15:27 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 3231AAF37; Tue, 27 Aug 2019 07:15:25 +0000 (UTC) Date: Tue, 27 Aug 2019 09:15:23 +0200 From: Michal Hocko To: Edward Chron Cc: Andrew Morton , Roman Gushchin , Johannes Weiner , David Rientjes , Tetsuo Handa , Shakeel Butt , linux-mm@kvack.org, linux-kernel@vger.kernel.org, colona@arista.com Subject: Re: [PATCH 00/10] OOM Debug print selection and additional information Message-ID: <20190827071523.GR7538@dhcp22.suse.cz> References: <20190826193638.6638-1-echron@arista.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190826193638.6638-1-echron@arista.com> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 26-08-19 12:36:28, Edward Chron wrote: [...] > Extensibility using OOM debug options > ------------------------------------- > What is needed is an extensible system to optionally configure > debug options as needed and to then dynamically enable and disable > them. Also for options that produce multiple lines of entry based > output, to configure which entries to print based on how much > memory they use (or optionally all the entries). With a patch this large and adding a lot of new stuff we need a more detailed usecases described I believe. [...] > Use of debugfs to allow dynamic controls > ---------------------------------------- > By providing a debugfs interface that allows options to be configured, > enabled and where appropriate to set a minimum size for selecting > entries to print, the output produced when an OOM event occurs can be > dynamically adjusted to produce as little or as much detail as needed > for a given system. Who is going to consume this information and why would that consumer be unreasonable to demand further maintenance of that information in future releases? In other words debugfs is not considered a stableAPI which is OK here but the side effect of any change to these files results in user visible behavior and we consider that more or less a stable as long as there are consumers. > OOM debug options can be added to the base code as needed. > > Currently we have the following OOM debug options defined: > > * System State Summary > -------------------- > One line of output that includes: > - Uptime (days, hour, minutes, seconds) We do have timestamps in the log so why is this needed? > - Number CPUs > - Machine Type > - Node name > - Domain name why are these needed? That is a static information that doesn't really influence the OOM situation. > - Kernel Release > - Kernel Version part of the oom report > > Example output when configured and enabled: > > Jul 27 10:56:46 yoursystem kernel: System Uptime:0 days 00:17:27 CPUs:4 Machine:x86_64 Node:yoursystem Domain:localdomain Kernel Release:5.3.0-rc2+ Version: #49 SMP Mon Jul 27 10:35:32 PDT 2019 > > * Tasks Summary > ------------- > One line of output that includes: > - Number of Threads > - Number of processes > - Forks since boot > - Processes that are runnable > - Processes that are in iowait We do have sysrq+t for this kind of information. Why do we need to duplicate it? > Example output when configured and enabled: > > Jul 22 15:20:57 yoursystem kernel: Threads:530 Processes:279 forks_since_boot:2786 procs_runable:2 procs_iowait:0 > > * ARP Table and/or Neighbour Discovery Table Summary > -------------------------------------------------- > One line of output each for ARP and ND that includes: > - Table name > - Table size (max # entries) > - Key Length > - Entry Size > - Number of Entries > - Last Flush (in seconds) > - hash grows > - entry allocations > - entry destroys > - Number lookups > - Number of lookup hits > - Resolution failures > - Garbage Collection Forced Runs > - Table Full > - Proxy Queue Length > > Example output when configured and enabled (for both): > > ... kernel: neighbour: Table: arp_tbl size: 256 keyLen: 4 entrySize: 360 entries: 9 lastFlush: 1721s hGrows: 1 allocs: 9 destroys: 0 lookups: 204 hits: 199 resFailed: 38 gcRuns/Forced: 111 / 0 tblFull: 0 proxyQlen: 0 > > ... kernel: neighbour: Table: nd_tbl size: 128 keyLen: 16 entrySize: 368 entries: 6 lastFlush: 1720s hGrows: 0 allocs: 7 destroys: 1 lookups: 0 hits: 0 resFailed: 0 gcRuns/Forced: 110 / 0 tblFull: 0 proxyQlen: 0 Again, why is this needed particularly for the OOM event? I do understand this might be useful system health diagnostic information but how does this contribute to the OOM? > * Add Select Slabs Print > ---------------------- > Allow select slab entries (based on a minimum size) to be printed. > Minimum size is specified as a percentage of the total RAM memory > in tenths of a percent, consistent with existing OOM process scoring. > Valid values are specified from 0 to 1000 where 0 prints all slab > entries (all slabs that have at least one slab object in use) up > to 1000 which would require a slab to use 100% of memory which can't > happen so in that case only summary information is printed. > > The first line of output is the standard Linux output header for > OOM printed Slab entries. This header looks like this: > > Aug 6 09:37:21 egc103 yourserver: Unreclaimable slab info: > > The output is existing slab entry memory usage limited such that only > entries equal to or larger than the minimum size are printed. > Empty slabs (no slab entries in slabs in use) are never printed. > > Additional output consists of summary information that is printed > at the end of the output. This summary information includes: > - # entries examined > - # entries selected and printed > - minimum entry size for selection > - Slabs total size (kB) > - Slabs reclaimable size (kB) > - Slabs unreclaimable size (kB) > > Example Summary output when configured and enabled: > > Jul 23 23:26:34 yoursystem kernel: Summary: Slab entries examined: 123 printed: 83 minsize: 0kB > > Jul 23 23:26:34 yoursystem kernel: Slabs Total: 151212kB Reclaim: 50632kB Unreclaim: 100580kB I am all for practical improvements for slab reporting. It is not really trivial to find a good balance though. Printing all the caches simply doesn't scale. So I would start by improving the current state rather than adding more configurability. > > * Add Select Vmalloc allocations Print > ------------------------------------ > Allow select vmalloc entries (based on a minimum size) to be printed. > Minimum size is specified as a percentage of the total RAM memory > in tenths of a percent, consistent with existing OOM process scoring. > Valid values are specified from 0 to 1000 where 0 prints all vmalloc > entries (all vmalloc allocations that have at least one page in use) up > to 1000 which would require a vmalloc to use 100% of memory which can't > happen so in that case only summary information is printed. > > The first line of output is a new Vmalloc output header for > OOM printed Vmalloc entries. This header looks like this: > > Aug 19 19:27:01 yourserver kernel: Vmalloc Info: > > The output is vmalloc entry information output limited such that only > entries equal to or larger than the minimum size are printed. > Unused vmallocs (no pages assigned to the vmalloc) are never printed. > The vmalloc entry information includes: > - Size (in bytes) > - pages (Number pages in use) > - Caller Information to identify the request > > A sample vmalloc entry output looks like this: > > Jul 22 20:16:09 yoursystem kernel: Vmalloc size=2625536 pages=640 caller=__do_sys_swapon+0x78e/0x113 > > Additional output consists of summary information that is printed > at the end of the output. This summary information includes: > - Number of Vmalloc entries examined > - Number of Vmalloc entries printed > - minimum entry size for selection > > A sample Vmalloc Summary output looks like this: > > Aug 19 19:27:01 coronado kernel: Summary: Vmalloc entries examined: 1070 printed: 989 minsize: 0kB This is a lot of information. I wouldn't be surprised if this alone could easily overflow the ringbuffer. Besides that, it is rarely useful for the OOM situation debugging. The overall size of the vmalloc area is certainly interesting but I am not sure we have a handy counter to cope with constrained OOM contexts. > * Add Select Process Entries Print > -------------------------------- > Allow select process entries (based on a minimum size) to be printed. > Minimum size is specified as a percentage totalpages (RAM + swap) > in tenths of a percent, consistent with existing OOM process scoring. > Note: user process memory can be swapped out when swap space present > so that is why swap space and ram memory comprise the totalpages > used to calculate the percentage of memory a process is using. > Valid values are specified from 0 to 1000 where 0 prints all user > processes (that have valid mm sections and aren't exiting) up to > 1000 which would require a user process to use 100% of memory which > can't happen so in that case only summary information is printed. > > The first line of output is the standard Linux output headers for > OOM printed User Processes. This header looks like this: > > Aug 19 19:27:01 yourserver kernel: Tasks state (memory values in pages): > Aug 19 19:27:01 yourserver kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name > > The output is existing per user process data limited such that only > entries equal to or larger than the minimum size are printed. > > Jul 21 20:07:48 yourserver kernel: [ 579] 0 579 7942 1010 90112 0 -1000 systemd-udevd > > Additional output consists of summary information that is printed > at the end of the output. This summary information includes: > > Aug 19 19:27:01 yourserver kernel: Summary: OOM Tasks considered:277 printed:143 minimum size:0kB totalpages:32791608kB This sounds like a good idea to limit the eligible process list size but I am concerned that it might get misleading easily when there are many small processes contributing to the OOM in the end. > * Add Enhanced Process Print Information > -------------------------------------- > Add OOM Debug code that prints additional detailed information about > users processes that were considered for OOM killing for any print > selected processes. The information is displayed for each user process > that OOM prints in the output. > > This supplemental per user process information is very helpful for > determing how process memory is used to allow OOM event root cause > identifcation that might not otherwise be possible. > > Output information for enhanced user process entrys printed includes: > - pid > - parent pid > - ruid > - euid > - tgid > - Process State (S) > - utime in seconds > - stime in seconds > - oom_score_adjust > - task comm value (name of process) > - Vmem KiB > - MaxRss KiB > - CurRss KiB > - Pte KiB > - Swap KiB > - Sock KiB > - Lib KiB > - Text KiB > - Heap KiB > - Stack KiB > - File KiB > - Shmem KiB > - Read Pages > - Fault Pages > - Lock KiB > - Pinned KiB I can see some of these being interesting but I would rather pick up those and add to the regular oom output rather than go over configuring them. > Configuring Patches: > ------------------- > OOM Debug and any options you want to use must first be configured so > the code is included in your kernel. This requires selecting kernel > config file options. You will find config options to select under: > > Kernel hacking ---> Memory Debugging ---> > > [*] Debug OOM > [*] Debug OOM System State > [*] Debug OOM System Tasks Summary > [*] Debug OOM ARP Table > [*] Debug OOM ND Table > [*] Debug OOM Select Slabs Print > [*] Debug OOM Slabs Select Always Print Enable > [*] Debug OOM Enhanced Slab Print > [*] Debug OOM Select Vmallocs Print > [*] Debug OOM Select Process Print > [*] Debug OOM Enhanced Process Print I really dislike these though. We already have zillions of debugging options and the config space is enormous. Different combinations of them make any compile testing a challenge and a lot of cpu cycles eaten. Besides that, who is going to configure those in without using them directly? Distributions are not going to enable without having all options being disabled by default for example. > 12 files changed, 1339 insertions(+), 11 deletions(-) This must have a been a lot of work and I really appreciate that. On the other hand it is a lot of code to maintain (note that you are usually introspecting deep internals of subsystems so changes would have to be carefully considered here as well) without a very strong demand. Sure it is a nice to have thing in some cases. I can imagine that some of that information would have helped me when debugging some weird OOM reports but I strongly suspect I would likely not have all necessary pieces enabled because those were not reproducible. Having everything on is just not usable due to amount of data. printk is not free and we have seen cases where a lot of output just turned the machine into unsuable state. If you have a reproducible OOMs then you can trigger a panic and have the full state of the system to examine. So I am not really convinced all this is going to be used to justify the maintenance overhead. All that being said, I do not think this is something we want to merge without a really _strong_ usecase to back it. Thanks! -- Michal Hocko SUSE Labs