2019-08-26 21:48:27

by Edward Chron

[permalink] [raw]
Subject: [PATCH 00/10] OOM Debug print selection and additional information

This patch series provides code that works as a debug option through
debugfs to provide additional controls to limit how much information
gets printed when an OOM event occurs and or optionally print additional
information about slab usage, vmalloc allocations, user process memory
usage, the number of processes / tasks and some summary information
about these tasks (number runable, i/o wait), system information
(#CPUs, Kernel Version and other useful state of the system),
ARP and ND Cache entry information.

Linux OOM can optionally provide a lot of information, what's missing?
----------------------------------------------------------------------
Linux provides a variety of detailed information when an OOM event occurs
but has limited options to control how much output is produced. The
system related information is produced unconditionally and limited per
user process information is produced as a default enabled option. The
per user process information may be disabled.

Slab usage information was recently added and is output only if slab
usage exceeds user memory usage.

Many OOM events are due to user application memory usage sometimes in
combination with the use of kernel resource usage that exceeds what is
expected memory usage. Detailed information about how memory was being
used when the event occurred may be required to identify the root cause
of the OOM event.

However, some environments are very large and printing all of the
information about processes, slabs and or vmalloc allocations may
not be feasible. For other environments printing as much information
about these as possible may be needed to root cause OOM events.

Extensibility using OOM debug options
-------------------------------------
What is needed is an extensible system to optionally configure
debug options as needed and to then dynamically enable and disable
them. Also for options that produce multiple lines of entry based
output, to configure which entries to print based on how much
memory they use (or optionally all the entries).

Limiting print entry output based on object size
------------------------------------------------
To limit output, a fixed size of object could be used such as:
vmallocs that use more than 1MB, slabs that are using more than
512KB, processes using 16MB or more of memory. Such an apporach
is quite reasonable.

Using OOM's memory metrics to limit printing based on entry size
----------------------------------------------------------------
However, the current implementation of OOM which has been in use for
almost a decade scores based on 1/10 % of memory. This methodology scales
well as memory sizes increase. If you limit the objects you examine to
those using 0.1% of memory you still may get a large number of objects
but avoid printing those using a relatively small amount of memory.

Further options that allow limiting output based on object size
can have the minimum size set to zero. In this case objects
that use even a small amount of memory will be printed.

Use of debugfs to allow dynamic controls
----------------------------------------
By providing a debugfs interface that allows options to be configured,
enabled and where appropriate to set a minimum size for selecting
entries to print, the output produced when an OOM event occurs can be
dynamically adjusted to produce as little or as much detail as needed
for a given system.

OOM debug options can be added to the base code as needed.

Currently we have the following OOM debug options defined:

* System State Summary
--------------------
One line of output that includes:
- Uptime (days, hour, minutes, seconds)
- Number CPUs
- Machine Type
- Node name
- Domain name
- Kernel Release
- Kernel Version

Example output when configured and enabled:

Jul 27 10:56:46 yoursystem kernel: System Uptime:0 days 00:17:27 CPUs:4 Machine:x86_64 Node:yoursystem Domain:localdomain Kernel Release:5.3.0-rc2+ Version: #49 SMP Mon Jul 27 10:35:32 PDT 2019

* Tasks Summary
-------------
One line of output that includes:
- Number of Threads
- Number of processes
- Forks since boot
- Processes that are runnable
- Processes that are in iowait

Example output when configured and enabled:

Jul 22 15:20:57 yoursystem kernel: Threads:530 Processes:279 forks_since_boot:2786 procs_runable:2 procs_iowait:0

* ARP Table and/or Neighbour Discovery Table Summary
--------------------------------------------------
One line of output each for ARP and ND that includes:
- Table name
- Table size (max # entries)
- Key Length
- Entry Size
- Number of Entries
- Last Flush (in seconds)
- hash grows
- entry allocations
- entry destroys
- Number lookups
- Number of lookup hits
- Resolution failures
- Garbage Collection Forced Runs
- Table Full
- Proxy Queue Length

Example output when configured and enabled (for both):

... kernel: neighbour: Table: arp_tbl size: 256 keyLen: 4 entrySize: 360 entries: 9 lastFlush: 1721s hGrows: 1 allocs: 9 destroys: 0 lookups: 204 hits: 199 resFailed: 38 gcRuns/Forced: 111 / 0 tblFull: 0 proxyQlen: 0

... kernel: neighbour: Table: nd_tbl size: 128 keyLen: 16 entrySize: 368 entries: 6 lastFlush: 1720s hGrows: 0 allocs: 7 destroys: 1 lookups: 0 hits: 0 resFailed: 0 gcRuns/Forced: 110 / 0 tblFull: 0 proxyQlen: 0

* Add Select Slabs Print
----------------------
Allow select slab entries (based on a minimum size) to be printed.
Minimum size is specified as a percentage of the total RAM memory
in tenths of a percent, consistent with existing OOM process scoring.
Valid values are specified from 0 to 1000 where 0 prints all slab
entries (all slabs that have at least one slab object in use) up
to 1000 which would require a slab to use 100% of memory which can't
happen so in that case only summary information is printed.

The first line of output is the standard Linux output header for
OOM printed Slab entries. This header looks like this:

Aug 6 09:37:21 egc103 yourserver: Unreclaimable slab info:

The output is existing slab entry memory usage limited such that only
entries equal to or larger than the minimum size are printed.
Empty slabs (no slab entries in slabs in use) are never printed.

Additional output consists of summary information that is printed
at the end of the output. This summary information includes:
- # entries examined
- # entries selected and printed
- minimum entry size for selection
- Slabs total size (kB)
- Slabs reclaimable size (kB)
- Slabs unreclaimable size (kB)

Example Summary output when configured and enabled:

Jul 23 23:26:34 yoursystem kernel: Summary: Slab entries examined: 123 printed: 83 minsize: 0kB

Jul 23 23:26:34 yoursystem kernel: Slabs Total: 151212kB Reclaim: 50632kB Unreclaim: 100580kB

* Add Select Vmalloc allocations Print
------------------------------------
Allow select vmalloc entries (based on a minimum size) to be printed.
Minimum size is specified as a percentage of the total RAM memory
in tenths of a percent, consistent with existing OOM process scoring.
Valid values are specified from 0 to 1000 where 0 prints all vmalloc
entries (all vmalloc allocations that have at least one page in use) up
to 1000 which would require a vmalloc to use 100% of memory which can't
happen so in that case only summary information is printed.

The first line of output is a new Vmalloc output header for
OOM printed Vmalloc entries. This header looks like this:

Aug 19 19:27:01 yourserver kernel: Vmalloc Info:

The output is vmalloc entry information output limited such that only
entries equal to or larger than the minimum size are printed.
Unused vmallocs (no pages assigned to the vmalloc) are never printed.
The vmalloc entry information includes:
- Size (in bytes)
- pages (Number pages in use)
- Caller Information to identify the request

A sample vmalloc entry output looks like this:

Jul 22 20:16:09 yoursystem kernel: Vmalloc size=2625536 pages=640 caller=__do_sys_swapon+0x78e/0x113

Additional output consists of summary information that is printed
at the end of the output. This summary information includes:
- Number of Vmalloc entries examined
- Number of Vmalloc entries printed
- minimum entry size for selection

A sample Vmalloc Summary output looks like this:

Aug 19 19:27:01 coronado kernel: Summary: Vmalloc entries examined: 1070 printed: 989 minsize: 0kB

* Add Select Process Entries Print
--------------------------------
Allow select process entries (based on a minimum size) to be printed.
Minimum size is specified as a percentage totalpages (RAM + swap)
in tenths of a percent, consistent with existing OOM process scoring.
Note: user process memory can be swapped out when swap space present
so that is why swap space and ram memory comprise the totalpages
used to calculate the percentage of memory a process is using.
Valid values are specified from 0 to 1000 where 0 prints all user
processes (that have valid mm sections and aren't exiting) up to
1000 which would require a user process to use 100% of memory which
can't happen so in that case only summary information is printed.

The first line of output is the standard Linux output headers for
OOM printed User Processes. This header looks like this:

Aug 19 19:27:01 yourserver kernel: Tasks state (memory values in pages):
Aug 19 19:27:01 yourserver kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name

The output is existing per user process data limited such that only
entries equal to or larger than the minimum size are printed.

Jul 21 20:07:48 yourserver kernel: [ 579] 0 579 7942 1010 90112 0 -1000 systemd-udevd

Additional output consists of summary information that is printed
at the end of the output. This summary information includes:

Aug 19 19:27:01 yourserver kernel: Summary: OOM Tasks considered:277 printed:143 minimum size:0kB totalpages:32791608kB

* Add Slab Select Always Print Enable
-----------------------------------
This option will enable slab entries to be printed even when slab
memory usage does not exceed the standard Linux user memory usage
print trigger. The Standard OOM event Slab entry print trigger is
that slab memory usage exceeds user memory usage. This covers cases
where the Kernel or Kernel drivers are driving slab memory usage up
causing it to be excessive. However, OOM Events are often caused by
user processes causing too much memory usage. In some cases where
the user memory usage is higher the amount of slab memory consumed
can still be an important factor in determining what caused the OOM
event. In such cases it would be useful to have slab memory usage
for any slab entries using a significant amount of memory.

No changes to output format occurs, enabling the option simply
causes what ever slabs are print eligible (from Select Slabs
option, which this option depends on) get printed on any OOM
event regardless of whether the memory usage by Slabs exceeds
user memory usage or not.

* Add Enhanced Slab Print Information
-----------------------------------
For any slab entries that are print eligible (from Select Slabs
option, which this option depends on) print some additional
details about the slab that can be useful to root causing
OOM events.

Output information for each enhanced slab entry includes:
- Used space (KiB)
- Total space (KiB)
- Active objects
- Total Objects
- Object size
- Aligned object size
- Object per Slab
- Pages per Slab
- Active Slabs
- Total Slabs
- Slab name

The header for enhanced slab entries is revised and looks like this:

Aug 19 19:27:01 coronado kernel: UsedKiB TotalKiB ActiveObj TotalObj ObjSize AlignSize Objs/Slab Pgs/Slab ActiveSlab TotalSlab Slab_Name

Each enhanced slab entry is similar to the following output format:

Aug 19 19:27:01 coronado kernel: 9016 9016 384710 384710 24 24 170 1 2263 2263 avtab_node


* Add Enhanced Process Print Information
--------------------------------------
Add OOM Debug code that prints additional detailed information about
users processes that were considered for OOM killing for any print
selected processes. The information is displayed for each user process
that OOM prints in the output.

This supplemental per user process information is very helpful for
determing how process memory is used to allow OOM event root cause
identifcation that might not otherwise be possible.

Output information for enhanced user process entrys printed includes:
- pid
- parent pid
- ruid
- euid
- tgid
- Process State (S)
- utime in seconds
- stime in seconds
- oom_score_adjust
- task comm value (name of process)
- Vmem KiB
- MaxRss KiB
- CurRss KiB
- Pte KiB
- Swap KiB
- Sock KiB
- Lib KiB
- Text KiB
- Heap KiB
- Stack KiB
- File KiB
- Shmem KiB
- Read Pages
- Fault Pages
- Lock KiB
- Pinned KiB

The headers for Processes changes to match the data being printed:

Aug 19 19:27:01 yourserver kernel: Tasks state (memory values in KiB):

...: [ pid ] ppid ruid euid tgid S utimeSec stimeSec VmemKiB MaxRssKiB CurRssKiB PteKiB SwapKiB SockKiB LibKiB TextKiB HeapKiB StackKiB FileKiB ShmemKiB ReadPgs FaultPgs LockKiB PinnedKiB Adjust name

A few entries that print formatted to match the second header:

...: [ 570] 1 0 0 570 S 0.530 0.105 31632 12064 3864 88 0 416 9500 208 3608 132 36 0 60 41615 0 0 -1000 systemd-udevd
...: [ 759] 1 0 0 759 S 1.264 0.545 17196 6072 788 72 0 624 8912 32 596 132 0 0 0 0 0 0 0 rngd
...: [ 1626] 1553 10383 10383 1626 S 9.417 2.355 3347904 336316 231672 924 0 416 56452 16 170656 276 2116 150756 4 2309 0 0 0 gnome-shell

Configuring Patches:
-------------------
OOM Debug and any options you want to use must first be configured so
the code is included in your kernel. This requires selecting kernel
config file options. You will find config options to select under:

Kernel hacking ---> Memory Debugging --->

[*] Debug OOM
[*] Debug OOM System State
[*] Debug OOM System Tasks Summary
[*] Debug OOM ARP Table
[*] Debug OOM ND Table
[*] Debug OOM Select Slabs Print
[*] Debug OOM Slabs Select Always Print Enable
[*] Debug OOM Enhanced Slab Print
[*] Debug OOM Select Vmallocs Print
[*] Debug OOM Select Process Print
[*] Debug OOM Enhanced Process Print

The heirarchy shown also displays the dependencies between OOM Debug for
these options. Everything depends on Debug OOM as that is where the base
code that all options require is located. Process has an Enhanced output
but requires Select Process to be enabled so you can limit the output
since you're asking for more details. The same is true with Slabs the
Enhanced output requires Select Slabs and so does Slabs Select Always
Print, to ensure you can limit your output if you need to.

Dyanmic enable/disable and setting entry minsize for Options
------------------------------------------------------------
As mentioned all options can be dynamically disabled and re-enabled.
The Select Options also allow setting minimum entry size to limit
entry printing based on the amount of memory they use, using the
OOM 0% to 100% in 1/10 % increments (1-1000). This is impelemented in
debugfs. Entries for OOM Debug are defined in the /sys/kernel/debug/oom
directory.

Arbitrary default values have been selected. The default is to enable
configured options and to set minimum entry size to 10 which is 1% of
the memory (or memory plus swap for processes). The choice was to
make sure by default you don't get a lot of data just for enabling an
option. Here is what the current defaults are set to for all the
OOM Debug options we currently have defined:

[root@yourserver ~]# grep "" /sys/kernel/debug/oom/*
/sys/kernel/debug/oom/arp_table_summary_enabled:Y
/sys/kernel/debug/oom/nd_table_summary_enabled:Y
/sys/kernel/debug/oom/process_enhanced_print_enabled:Y
/sys/kernel/debug/oom/process_select_print_enabled:Y
/sys/kernel/debug/oom/process_select_print_tenthpercent:10
/sys/kernel/debug/oom/slab_enhanced_print_enabled:Y
/sys/kernel/debug/oom/slab_select_always_print_enabled:Y
/sys/kernel/debug/oom/slab_select_print_enabled:Y
/sys/kernel/debug/oom/slab_select_print_tenthpercent:10
/sys/kernel/debug/oom/system_state_summary_enabled:Y
/sys/kernel/debug/oom/tasks_summary_enabled:Y
/sys/kernel/debug/oom/vmalloc_select_print_enabled:Y
/sys/kernel/debug/oom/vmalloc_select_print_tenthpercent:10

You can disable or re-enable options in the appropriate enable file
or adjust the minimum size value in the appropriate tenthpercent file
as needed.

---------------------------------------------------------------------

Edward Chron (10):
mm/oom_debug: Add Debug base code
mm/oom_debug: Add System State Summary
mm/oom_debug: Add Tasks Summary
mm/oom_debug: Add ARP and ND Table Summary usage
mm/oom_debug: Add Select Slabs Print
mm/oom_debug: Add Select Vmalloc Entries Print
mm/oom_debug: Add Select Process Entries Print
mm/oom_debug: Add Slab Select Always Print Enable
mm/oom_debug: Add Enhanced Slab Print Information
mm/oom_debug: Add Enhanced Process Print Information

include/linux/oom.h | 1 +
include/linux/vmalloc.h | 12 +
include/net/neighbour.h | 12 +
mm/Kconfig.debug | 228 +++++++++++++
mm/Makefile | 1 +
mm/oom_kill.c | 83 ++++-
mm/oom_kill_debug.c | 736 ++++++++++++++++++++++++++++++++++++++++
mm/oom_kill_debug.h | 58 ++++
mm/slab.h | 4 +
mm/slab_common.c | 94 +++++
mm/vmalloc.c | 43 +++
net/core/neighbour.c | 78 +++++
12 files changed, 1339 insertions(+), 11 deletions(-)
create mode 100644 mm/oom_kill_debug.c
create mode 100644 mm/oom_kill_debug.h

--
2.20.1


2019-08-26 21:48:44

by Edward Chron

[permalink] [raw]
Subject: [PATCH 09/10] mm/oom_debug: Add Enhanced Slab Print Information

Add OOM Debug code that prints additional detailed information about each
slab entry that has been selected for printing. The information is
displayed for each slab enrty selected for print. The extra information
is helpful for root cause identification and problem analysis.

Configuring Enhanced Process Print Information
----------------------------------------------
The Kernel configuration option that defines this option is
DEBUG_OOM_ENHANCED_SLAB_PRINT. This additional code is dependent on the
OOM Debug option DEBUG_OOM_SLAB_SELECT_PRINT which adds code to allow
processes that are considered for OOM kill to be selectively printed,
only printing processes that use a specified minimum amount of memory.

The kernel configuration entry for this option can be found in the
config file at: Kernel hacking, Memory Debugging, Debug OOM,
Debug OOM Process Selection, Debug OOM Enhanced Process Print.
Both Debug OOM Process Selection and Debug OOM Enhanced Process Print
entries must be selected.

Dynamic disable or re-enable this OOM Debug option
--------------------------------------------------
This option may be disabled or re-enabled using the debugfs entry for
this OOM debug option. The debugfs file to enable this entry is found at:
/sys/kernel/debug/oom/process_enhanced_print_enabled where the enabled
file's value determines whether the facility is enabled or disabled.
A value of 1 is enabled (default) and a value of 0 is disabled.
When configured the default setting is set to enabled.

Content and format of slab entry messages
-----------------------------------------
In addition to the Used and Total space (in KiB) fields that are
displayed by the standard Linux OOM slab reporting code the enhanced
entries include: active objects, total objects, object and align size
(both in bytes), objects per slab, pages per slab, active slabs,
total slabs and the slab name (located at the end, easier to read).

Sample Output
-------------
Sample oom report message header and output slab entry message:

Aug 13 18:52:47 mysrvr kernel: UsedKiB TotalKiB ActiveObj TotalObj
ObjSize AlignSize Objs/Slab Pgs/Slab ActiveSlab TotalSlab Slab_Name

Aug 13 18:52:47 mysrvr kernel: 403 412 1613 1648
224 256 16 1 103 103 skbuff_head..

Signed-off-by: Edward Chron <[email protected]>
---
mm/Kconfig.debug | 15 +++++++++++++++
mm/oom_kill_debug.c | 15 +++++++++++++++
mm/oom_kill_debug.h | 3 +++
mm/slab_common.c | 29 +++++++++++++++++++++++++++++
4 files changed, 62 insertions(+)

diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index 68873e26afe1..4414e46f72c6 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -244,6 +244,21 @@ config DEBUG_OOM_SLAB_SELECT_ALWAYS_PRINT

If unsure, say N.

+config DEBUG_OOM_ENHANCED_SLAB_PRINT
+ bool "Debug OOM Enhanced Slab Print"
+ depends on DEBUG_OOM_SLAB_SELECT_PRINT
+ help
+ Each OOM slab entry printed includes slab entry information
+ about it's memory usage. Memory usage is specified in KiB (KB)
+ and includes the following fields:
+
+ If the option is configured it is enabled/disabled by setting
+ the value of the file entry in the debugfs OOM interface at:
+ /sys/kernel/debug/oom/process_enhanced_print_enabled
+ A value of 1 is enabled (default) and a value of 0 is disabled.
+
+ If unsure, say N.
+
config DEBUG_OOM_VMALLOC_SELECT_PRINT
bool "Debug OOM Select Vmallocs Print"
depends on DEBUG_OOM
diff --git a/mm/oom_kill_debug.c b/mm/oom_kill_debug.c
index 13f1d1c25a67..ad937b3d59f3 100644
--- a/mm/oom_kill_debug.c
+++ b/mm/oom_kill_debug.c
@@ -244,6 +244,12 @@ static struct oom_debug_option oom_debug_options_table[] = {
.option_name = "slab_select_always_print_",
.support_tpercent = false,
},
+#endif
+#ifdef CONFIG_DEBUG_OOM_ENHANCED_SLAB_PRINT
+ {
+ .option_name = "slab_enhanced_print_",
+ .support_tpercent = false,
+ },
#endif
{}
};
@@ -273,6 +279,9 @@ enum oom_debug_options_index {
#endif
#ifdef CONFIG_DEBUG_OOM_SLAB_SELECT_ALWAYS_PRINT
SLAB_ALWAYS_STATE,
+#endif
+#ifdef CONFIG_DEBUG_OOM_ENHANCED_SLAB_PRINT
+ ENHANCED_SLAB_STATE,
#endif
OUT_OF_BOUNDS
};
@@ -350,6 +359,12 @@ bool oom_kill_debug_select_slabs_always_print_enabled(void)
return oom_kill_debug_enabled(SLAB_ALWAYS_STATE);
}
#endif
+#ifdef CONFIG_DEBUG_OOM_ENHANCED_SLAB_PRINT
+bool oom_kill_debug_enhanced_slab_print_information_enabled(void)
+{
+ return oom_kill_debug_enabled(ENHANCED_SLAB_STATE);
+}
+#endif

#ifdef CONFIG_DEBUG_OOM_SYSTEM_STATE
/*
diff --git a/mm/oom_kill_debug.h b/mm/oom_kill_debug.h
index bce740573063..a39bc275980e 100644
--- a/mm/oom_kill_debug.h
+++ b/mm/oom_kill_debug.h
@@ -18,6 +18,9 @@ extern bool oom_kill_debug_unreclaimable_slabs_print(void);
#ifdef CONFIG_DEBUG_OOM_SLAB_SELECT_ALWAYS_PRINT
extern bool oom_kill_debug_select_slabs_always_print_enabled(void);
#endif
+#ifdef CONFIG_DEBUG_OOM_ENHANCED_SLAB_PRINT
+extern bool oom_kill_debug_enhanced_slab_print_information_enabled(void);
+#endif

extern u32 oom_kill_debug_oom_event_is(void);
extern u32 oom_kill_debug_event(void);
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 9ddc95040b60..c6e17e5c6c9d 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -28,6 +28,10 @@

#include "slab.h"

+#ifdef CONFIG_DEBUG_OOM_ENHANCED_SLAB_PRINT
+#include "oom_kill_debug.h"
+#endif
+
enum slab_state slab_state;
LIST_HEAD(slab_caches);
DEFINE_MUTEX(slab_mutex);
@@ -1450,15 +1454,40 @@ void dump_unreclaimable_slab(void)
mutex_unlock(&slab_mutex);
}

+#ifdef CONFIG_DEBUG_OOM_ENHANCED_SLAB_PRINT
+static void oom_debug_slab_enhanced_print(struct slabinfo *psi,
+ struct kmem_cache *pkc)
+{
+ pr_info("%10lu %10lu %10lu %10lu %9u %9u %9u %8u %10lu %10lu %s\n",
+ (psi->active_objs * pkc->size) / 1024,
+ (psi->num_objs * pkc->size) / 1024, psi->active_objs,
+ psi->num_objs, pkc->object_size, pkc->size,
+ psi->objects_per_slab, (1 << psi->cache_order),
+ psi->active_slabs, psi->num_slabs, cache_name(pkc));
+}
+#endif
+
#ifdef CONFIG_DEBUG_OOM_SLAB_SELECT_PRINT
static void oom_debug_slab_header_print(void)
{
pr_info("Unreclaimable slab info:\n");
+#ifdef CONFIG_DEBUG_OOM_ENHANCED_SLAB_PRINT
+ if (oom_kill_debug_enhanced_slab_print_information_enabled()) {
+ pr_info(" UsedKiB TotalKiB ActiveObj TotalObj ObjSize AlignSize Objs/Slab Pgs/Slab ActiveSlab TotalSlab Slab_Name");
+ return;
+ }
+#endif
pr_info("Name Used Total\n");
}

static void oom_debug_slab_print(struct slabinfo *psi, struct kmem_cache *pkc)
{
+#ifdef CONFIG_DEBUG_OOM_ENHANCED_SLAB_PRINT
+ if (oom_kill_debug_enhanced_slab_print_information_enabled()) {
+ oom_debug_slab_enhanced_print(psi, pkc);
+ return;
+ }
+#endif
pr_info("%-17s %10luKB %10luKB\n", cache_name(pkc),
(psi->active_objs * pkc->size) / 1024,
(psi->num_objs * pkc->size) / 1024);
--
2.20.1

2019-08-26 21:48:50

by Edward Chron

[permalink] [raw]
Subject: [PATCH 02/10] mm/oom_debug: Add System State Summary

When selected, prints the number of CPUs online at the time of the OOM
event. Also prints nodename, domainname, machine type, kernel release
and version, system uptime, total memory and swap size. Produces a
single line of output holding this information.

This information is useful to help determine the state the system was
in when the event was triggered which is helpful for debugging,
performance measurements and security issues.

Configuring this Debug Option (DEBUG_OOM_SYSTEM_STATE)
------------------------------------------------------
To enable the option it needs to be configured in the OOM Debugging
configure menu. The kernel configuration entry can be found in the
config at: Kernel hacking, Memory Debugging, OOM Debugging the
DEBUG_OOM_SYSTEM_STATE config entry.

Dynamic disable or re-enable this OOM Debug option
--------------------------------------------------
The oom debugfs base directory is found at: /sys/kernel/debug/oom.
The oom debugfs for this option is: system_state_summary_
and the file for this option is the enable file.

This option may be disabled or re-enabled using the debugfs enable file
for this OOM debug option. The debugfs file to enable this entry is found
at: /sys/kernel/debug/oom/system_state_summary_enabled where the enabled
file's value determines whether the facility is enabled or disabled.
A value of 1 is enabled and a value of 0 is disabled.
When configured the default setting is set to enabled.

Content and format of System State Summary Output
-------------------------------------------------
One line of output that includes:
- Uptime (days, hour, minutes, seconds)
- Number CPUs
- Machine Type
- Node name
- Domain name
- Kernel Release
- Kernel Version

Sample Output:
-------------
Sample System State Summary message:

Jul 27 10:56:46 yoursystem kernel: System Uptime:0 days 00:17:27
CPUs:4 Machine:x86_64 Node:yoursystem Domain:localdomain
Kernel Release:5.3.0-rc2+ Version: #49 SMP Mon Jul 27 10:35:32 PDT 2019


Signed-off-by: Edward Chron <[email protected]>
---
mm/Kconfig.debug | 15 +++++++++
mm/oom_kill_debug.c | 81 +++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 96 insertions(+)

diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index 5610da5fa614..dbe599b67a3b 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -132,3 +132,18 @@ config DEBUG_OOM
option is a prerequisite for selecting any OOM debugging options.

If unsure, say N
+
+config DEBUG_OOM_SYSTEM_STATE
+ bool "Debug OOM System State"
+ depends on DEBUG_OOM
+ help
+ When enabled, provides one line of output on an oom event to
+ document the state of the system when the oom event occurred.
+ Prints: uptime, # threads, # processes, system memory size in KiB
+ and swap space size in KiB, nodename, domainname, machine type,
+ kernel release and version. If configured it is enabled/disabled
+ by setting the enabled file entry in the debugfs OOM interface
+ at: /sys/kernel/debug/oom/system_state_summary_enabled
+ A value of 1 is enabled (default) and a value of 0 is disabled.
+
+ If unsure, say N.
diff --git a/mm/oom_kill_debug.c b/mm/oom_kill_debug.c
index af07e662c808..6eeaad86fca8 100644
--- a/mm/oom_kill_debug.c
+++ b/mm/oom_kill_debug.c
@@ -144,6 +144,14 @@
#include <linux/sysfs.h>
#include "oom_kill_debug.h"

+#ifdef CONFIG_DEBUG_OOM_SYSTEM_STATE
+#include <linux/cpumask.h>
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/utsname.h>
+#include <linux/sched/stat.h>
+#endif
+
#define OOMD_MAX_FNAME 48
#define OOMD_MAX_OPTNAME 32

@@ -169,11 +177,20 @@ struct oom_debug_option {

/* Table of oom debug options, new options need to be added here */
static struct oom_debug_option oom_debug_options_table[] = {
+#ifdef CONFIG_DEBUG_OOM_SYSTEM_STATE
+ {
+ .option_name = "system_state_summary_",
+ .support_tpercent = false,
+ },
+#endif
{}
};

/* Option index by name for order one-lookup, add new options entry here */
enum oom_debug_options_index {
+#ifdef CONFIG_DEBUG_OOM_SYSTEM_STATE
+ SYSTEM_STATE,
+#endif
OUT_OF_BOUNDS
};

@@ -244,10 +261,74 @@ u32 oom_kill_debug_oom_event(void)
return oom_kill_debug_oom_events;
}

+#ifdef CONFIG_DEBUG_OOM_SYSTEM_STATE
+/*
+ * oom_kill_debug_system_summary_prt - provides one line of output to document
+ * some of the system state at the time of an oom event.
+ * Output line includes: uptime, # threads, # processes,
+ * system memory size in KiB and swap space size in KiB,
+ * nodename, domainname, machine type, kernel release
+ * and version.
+ */
+static void oom_kill_debug_system_summary_prt(void)
+{
+ struct new_utsname *p_uts;
+ char domainname[256];
+ unsigned long upsecs;
+ unsigned short hours;
+ struct timespec64 tp;
+ unsigned short days;
+ unsigned short mins;
+ unsigned short secs;
+ char nodename[256];
+ size_t nodesize;
+ char *p_wend;
+ long uptime;
+ int procs;
+
+ p_uts = utsname();
+
+ memset(nodename, 0, sizeof(nodename));
+ memset(domainname, 0, sizeof(domainname));
+
+ p_wend = strchr(p_uts->nodename, '.');
+ if (p_wend != NULL) {
+ nodesize = p_wend - p_uts->nodename;
+ ++p_wend;
+ strncpy(nodename, p_uts->nodename, nodesize);
+ strcpy(domainname, p_wend);
+ } else {
+ strcpy(nodename, p_uts->nodename);
+ strcpy(domainname, "(none)");
+ }
+
+ procs = nr_processes();
+
+ ktime_get_boottime_ts64(&tp);
+ uptime = tp.tv_sec + (tp.tv_nsec ? 1 : 0);
+
+ days = uptime / 86400;
+ upsecs = uptime - (days * 86400);
+ hours = upsecs / 3600;
+ upsecs = upsecs - (hours * 3600);
+ mins = upsecs / 60;
+ secs = upsecs - (mins * 60);
+
+ pr_info("System Uptime:%hu days %02hu:%02hu:%02hu CPUs:%u Machine:%s Node:%s Domain:%s Kernel Release:%s Version:%s\n",
+ days, hours, mins, secs, num_online_cpus(), p_uts->machine,
+ nodename, domainname, p_uts->release, p_uts->version);
+}
+#endif /* CONFIG_DEBUG_OOM_SYSTEM_STATE */
+
u32 oom_kill_debug_oom_event_is(void)
{
++oom_kill_debug_oom_events;

+#ifdef CONFIG_DEBUG_OOM_SYSTEM_STATE
+ if (oom_kill_debug_enabled(SYSTEM_STATE))
+ oom_kill_debug_system_summary_prt();
+#endif
+
return oom_kill_debug_oom_events;
}

--
2.20.1

2019-08-26 21:48:54

by Edward Chron

[permalink] [raw]
Subject: [PATCH 04/10] mm/oom_debug: Add ARP and ND Table Summary usage

Adds config options and code to support printing ARP Table usage and or
Neighbour Discovery Table usage when an OOM event occurs. This summarized
information provides the memory usage for each table when configured.

Configuring these two OOM Debug Options
---------------------------------------
Two OOM debug options: CONFIG_DEBUG_OOM_ARP_TBL, CONFIG_DEBUG_OOM_ND_TBL
To get the output for both tables they both must be configured.
The ARP Table uses the CONFIG_DEBUG_OOM_ARP_TBL kernel config option
and the ND Table uses the CONFIG_DEBUG_OOM_ND_TBL kernel config option
both of which are found in the kernel config under the entries:
Kernel hacking, Memory Debugging, OOM Debugging entry. The ARP Table and
ND Table are configured there with the options: DEBUG_OOM_ARP_TBL and
DEBUG_OOM_ND_TBL respectively.

Dynamic disable or re-enable this OOM Debug option
--------------------------------------------------
The oom debugfs base directory is found at: /sys/kernel/debug/oom.
The oom debugfs for this option are: arp_table_summary_ and
nd_table_summary_ and there is just one enable file for each.

Either option may be disabled or re-enabled using the debugfs entry for
the OOM debug option. The debugfs file to enable the ARP Table option
is found at: /sys/kernel/debug/oom/arp_table_summary_enabled
Similarly, the debugfs file to enable the ND Table option is found at:
/sys/kernel/debug/oom/nd_table_summary_enabled
For either option their enabled file's value determines whether the
facility is enabled or disabled for that option. A value of 1 is enabled
(default) and a value of 0 is disabled. When configured the default
setting is set to enabled. Each option will produce 1 line of output.

Content and format of ARP and Neighbour Discovery Tables Summary Output
-----------------------------------------------------------------------
One line of output each for ARP and ND that includes:
- Table name
- Table size (max # entries)
- Key Length
- Entry Size
- Number of Entries
- Last Flush (in seconds)
- hash grows
- entry allocations
- entry destroys
- Number lookups
- Number of lookup hits
- Resolution failures
- Garbage Collection Forced Runs
- Table Full
- Proxy Queue Length

Sample Output:
-------------
Here is sample output for both the ARP table and ND table:

Jul 23 23:26:34 yuorsystem kernel: neighbour: Table: arp_tbl size: 256
keyLen: 4 entrySize: 360 entries: 9 lastFlush: 1721s
hGrows: 1 allocs: 9 destroys: 0 lookups: 204 hits: 199
resFailed: 38 gcRuns/Forced: 111 / 0 tblFull: 0 proxyQlen: 0

Jul 23 23:26:34 yuorsystem kernel: neighbour: Table: nd_tbl size: 128
keyLen: 16 entrySize: 368 entries: 6 lastFlush: 1720s
hGrows: 0 allocs: 7 destroys: 1 lookups: 0 hits: 0
resFailed: 0 gcRuns/Forced: 110 / 0 tblFull: 0 proxyQlen: 0


Signed-off-by: Edward Chron <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: [email protected]
---
include/net/neighbour.h | 12 +++++++
mm/Kconfig.debug | 26 ++++++++++++++
mm/oom_kill_debug.c | 38 ++++++++++++++++++++
net/core/neighbour.c | 78 +++++++++++++++++++++++++++++++++++++++++
4 files changed, 154 insertions(+)

diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index 50a67bd6a434..35fdecff2724 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -569,4 +569,16 @@ static inline void neigh_update_is_router(struct neighbour *neigh, u32 flags,
*notify = 1;
}
}
+
+#if defined(CONFIG_DEBUG_OOM_ARP_TBL) || defined(CONFIG_DEBUG_OOM_ND_TBL)
+/**
+ * Routine used to print arp table and neighbour table statistics.
+ * Output goes to dmesg along with all the other OOM related messages
+ * when the config options DEBUG_OOM_ARP_TBL and DEBUG_ND_TBL are set to
+ * yes, for the ARP table and Neighbour discovery table respectively.
+ */
+extern void neightbl_print_stats(const char * const tblname,
+ struct neigh_table * const neightable);
+#endif /* CONFIG_DEBUG_OOM_ARP_TBL || CONFIG_DEBUG_OOM_ND_TBL */
+
#endif
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index fcbc5f9aa146..fe4bb5ce0a6d 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -163,3 +163,29 @@ config DEBUG_OOM_TASKS_SUMMARY
A value of 1 is enabled (default) and a value of 0 is disabled.

If unsure, say N.
+
+config DEBUG_OOM_ARP_TBL
+ bool "Debug OOM ARP Table"
+ depends on DEBUG_OOM
+ help
+ When enabled, documents kernel memory usage by the ARP Table
+ entries at the time of an OOM event. Output is one line of
+ summarzied ARP Table usage. If configured it is enabled/disabled
+ by setting the enabled file entry in the debugfs OOM interface
+ at: /sys/kernel/debug/oom/arp_table_summary_enabled
+ A value of 1 is enabled (default) and a value of 0 is disabled.
+
+ If unsure, say N.
+
+config DEBUG_OOM_ND_TBL
+ bool "Debug OOM ND Table"
+ depends on DEBUG_OOM
+ help
+ When enabled, documents kernel memory usage by the ND Table
+ entries at the time of an OOM event. Output is one line of
+ summarzied ND Table usage. If configured it is enabled/disabled
+ by setting the enabled file entry in the debugfs OOM interface
+ at: /sys/kernel/debug/oom/nd_table_summary_enabled
+ A value of 1 is enabled (default) and a value of 0 is disabled.
+
+ If unsure, say N.
diff --git a/mm/oom_kill_debug.c b/mm/oom_kill_debug.c
index 395b3307f822..c4a9117633fd 100644
--- a/mm/oom_kill_debug.c
+++ b/mm/oom_kill_debug.c
@@ -156,6 +156,16 @@
#include <linux/sched/stat.h>
#endif

+#if defined(CONFIG_INET) && defined(CONFIG_DEBUG_OOM_ARP_TBL)
+#include <net/arp.h>
+#endif
+#if defined(CONFIG_IPV6) && defined(CONFIG_DEBUG_OOM_ND_TBL)
+#include <net/ndisc.h>
+#endif
+#if defined(CONFIG_DEBUG_OOM_ARP_TBL) || defined(CONFIG_DEBUG_OOM_ND_TBL)
+#include <net/neighbour.h>
+#endif
+
#define OOMD_MAX_FNAME 48
#define OOMD_MAX_OPTNAME 32

@@ -192,6 +202,18 @@ static struct oom_debug_option oom_debug_options_table[] = {
.option_name = "tasks_summary_",
.support_tpercent = false,
},
+#endif
+#ifdef CONFIG_DEBUG_OOM_ARP_TBL
+ {
+ .option_name = "arp_table_summary_",
+ .support_tpercent = false,
+ },
+#endif
+#ifdef CONFIG_DEBUG_OOM_ND_TBL
+ {
+ .option_name = "nd_table_summary_",
+ .support_tpercent = false,
+ },
#endif
{}
};
@@ -203,6 +225,12 @@ enum oom_debug_options_index {
#endif
#ifdef CONFIG_DEBUG_OOM_TASKS_SUMMARY
TASKS_STATE,
+#endif
+#ifdef CONFIG_DEBUG_OOM_ARP_TBL
+ ARP_STATE,
+#endif
+#ifdef CONFIG_DEBUG_OOM_ND_TBL
+ ND_STATE,
#endif
OUT_OF_BOUNDS
};
@@ -351,6 +379,16 @@ u32 oom_kill_debug_oom_event_is(void)
oom_kill_debug_system_summary_prt();
#endif

+#if defined(CONFIG_INET) && defined(CONFIG_DEBUG_OOM_ARP_TBL)
+ if (oom_kill_debug_enabled(ARP_STATE))
+ neightbl_print_stats("arp_tbl", &arp_tbl);
+#endif
+
+#if defined(CONFIG_IPV6) && defined(CONFIG_DEBUG_OOM_ND_TBL)
+ if (oom_kill_debug_enabled(ND_STATE))
+ neightbl_print_stats("nd_tbl", &nd_tbl);
+#endif
+
#ifdef CONFIG_DEBUG_OOM_TASKS_SUMMARY
if (oom_kill_debug_enabled(TASKS_STATE))
oom_kill_debug_tasks_summary_print();
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index f79e61c570ea..9f5a579542a9 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -3735,3 +3735,81 @@ static int __init neigh_init(void)
}

subsys_initcall(neigh_init);
+
+#if defined(CONFIG_DEBUG_OOM_ARP_TBL) || defined(CONFIG_DEBUG_OOM_ND_TBL)
+void neightbl_print_stats(const char * const tblname,
+ struct neigh_table * const tbl)
+{
+ struct neigh_hash_table *nht;
+ struct ndt_stats ndst;
+ u32 now;
+ u32 flush_delta;
+ u32 tblsize;
+ u16 key_len;
+ u16 entry_size;
+ u32 entries;
+ u32 last_flush; /* delta to now in msecs */
+ u32 hash_shift;
+ u32 proxy_qlen;
+ int cpu;
+
+ read_lock_bh(&tbl->lock);
+ now = jiffies;
+ flush_delta = now - tbl->last_flush;
+
+ key_len = tbl->key_len;
+ if (tbl->entry_size)
+ entry_size = tbl->entry_size;
+ else
+ entry_size = ALIGN(offsetof(struct neighbour, primary_key) +
+ key_len, NEIGH_PRIV_ALIGN);
+
+ entries = atomic_read(&tbl->entries);
+ if (entries == 0)
+ goto out_tbl_unlock;
+
+ /* last flush was last_flush seconds ago */
+ last_flush = jiffies_to_msecs(flush_delta) / 1000;
+ proxy_qlen = tbl->proxy_queue.qlen;
+
+ rcu_read_lock_bh();
+ nht = rcu_dereference_bh(tbl->nht);
+ if (nht)
+ hash_shift = nht->hash_shift + 1;
+ rcu_read_unlock_bh();
+ if (!nht)
+ goto out_tbl_unlock;
+
+ memset(&ndst, 0, sizeof(ndst));
+ for_each_possible_cpu(cpu) {
+ struct neigh_statistics *st;
+
+ st = per_cpu_ptr(tbl->stats, cpu);
+ ndst.ndts_allocs += st->allocs;
+ ndst.ndts_destroys += st->destroys;
+ ndst.ndts_hash_grows += st->hash_grows;
+ ndst.ndts_res_failed += st->res_failed;
+ ndst.ndts_lookups += st->lookups;
+ ndst.ndts_hits += st->hits;
+ ndst.ndts_periodic_gc_runs += st->periodic_gc_runs;
+ ndst.ndts_forced_gc_runs += st->forced_gc_runs;
+ ndst.ndts_table_fulls += st->table_fulls;
+ }
+
+ read_unlock_bh(&tbl->lock);
+ tblsize = (1 << hash_shift) * sizeof(struct neighbour *);
+ if (tblsize > PAGE_SIZE)
+ tblsize = get_order(tblsize);
+
+ pr_info("Table:%7s size:%5u keyLen:%2hu entrySize:%3hu entries:%5u lastFlush:%5us hGrows:%5llu allocs:%5llu destroys:%5llu lookups:%5llu hits:%5llu resFailed:%5llu gcRuns/Forced:%3llu / %2llu tblFull:%2llu proxyQlen:%2u\n",
+ tblname, tblsize, key_len, entry_size, entries, last_flush,
+ ndst.ndts_hash_grows, ndst.ndts_allocs, ndst.ndts_destroys,
+ ndst.ndts_lookups, ndst.ndts_hits, ndst.ndts_res_failed,
+ ndst.ndts_periodic_gc_runs, ndst.ndts_forced_gc_runs,
+ ndst.ndts_table_fulls, proxy_qlen);
+ return;
+
+out_tbl_unlock:
+ read_unlock_bh(&tbl->lock);
+}
+#endif /* CONFIG_DEBUG_OOM_ARP_TBL || CONFIG_DEBUG_OOM_ND_TBL */
--
2.20.1

2019-08-26 21:49:01

by Edward Chron

[permalink] [raw]
Subject: [PATCH 05/10] mm/oom_debug: Add Select Slabs Print

Add OOM Debug code to allow select slab entries to be printed at the
time of an OOM event. Linux has added printing slab entries on an
OOM event, if the amount of memory used by slabs exceeds the amount
of memory used by user processes. This OOM Debug option allows
slab entries of a specified minimum entry size to be printed,
limiting the amount of print output an OOM event generates for slab
entries.

Configuring this OOM Debug Option (DEBUG_OOM_SLAB_SELECT_PRINT)
---------------------------------------------------------------
To configure this OOM debug option it needs to be configured
in the OOM Debugging configure menu. The kernel configuration entry
can be found in the config at: Kernel hacking, Memory Debugging,
OOM Debugging the DEBUG_OOM_SLAB_SELECT_PRINT config entry.

Two dynamic OOM debug settings for this option: enable, tenthpercent
--------------------------------------------------------------------
The oom debugfs base directory is found at: /sys/kernel/debug/oom.
The oom debugfs for this option is: slab_select_print_
and for select options there are two files, the enable file and
the tenthpercent file are the debugfs files.

Dynamic disable or re-enable this OOM Debug option
--------------------------------------------------
This option may be disabled or re-enabled using the debugfs entry for
this OOM debug option. The debugfs file to enable this entry is found
at: /sys/kernel/debug/oom/slab_select_print_enabled where the enabled
file's value determines whether the facility is enabled or disabled.
A value of 1 is enabled (default) and a value of 0 is disabled. The
default if configured is enabled.

Specifying the minimum entry size (0-1000) in the tenthpercent file
-------------------------------------------------------------------
Also for DEBUG_OOM_SLAB_SELECT_PRINT the number of slab entries printed is
adjustable. By default if the DEBUG_OOM_SLAB_SELECT_PRINT config option
is enabled entries that use 1% or more of memory are printed. This can be
adjusted to be entries as small as 0% of memory or as large as 100% of
memory in which case only a summary line is printed, as no slab entry
could possibly use 100% of memory. Adjustments are made in the debugfs
file found at: /sys/kernel/debug/oom/slab_select_print_tenthpercent
Entry values that are valid are 0 through 1000 which represent memory
usage of 0% of memory to 100% of memory. A value of of 0 prints all
slabs that have at least one slab in use, unused slabs are not printed.

Content of Slab Summary Records Output
--------------------------------------
Additional output consists of summary information that is printed
at the end of the output. This summary information includes:
- # entries examined
- # entries selected and printed
- minimum entry size for selection
- Slabs total size (kB)
- Slabs reclaimable size (kB)
- Slabs unreclaimable size (kB)

Sample Output
-------------
Output produced consists of the standard output currently produced
by Linux for slab entries plus two lines of summary information.
(The standard output provides a section header and entry per slab)

Summary output (minsize = 0kB, all entries with > 0 slabs in use printed):

Jul 23 23:26:34 yoursystem kernel: Summary: Slab entries examined: 123
printed: 83 minsize: 0kB

Jul 23 23:26:34 yoursystem kernel: Slabs Total: 151212kB Reclaim: 50632kB
Unreclaim: 100580kB


Signed-off-by: Edward Chron <[email protected]>
---
mm/Kconfig.debug | 30 +++++++++++++++++++++
mm/oom_kill.c | 11 +++++++-
mm/oom_kill_debug.c | 42 +++++++++++++++++++++++++++++
mm/oom_kill_debug.h | 4 +++
mm/slab.h | 4 +++
mm/slab_common.c | 65 +++++++++++++++++++++++++++++++++++++++++++++
6 files changed, 155 insertions(+), 1 deletion(-)

diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index fe4bb5ce0a6d..c7d53ca95d32 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -189,3 +189,33 @@ config DEBUG_OOM_ND_TBL
A value of 1 is enabled (default) and a value of 0 is disabled.

If unsure, say N.
+
+config DEBUG_OOM_SLAB_SELECT_PRINT
+ bool "Debug OOM Select Slabs Print"
+ depends on DEBUG_OOM
+ help
+ When enabled, allows the number of unreclaimable slab entries
+ to be print rate limited based on the amount of memory the
+ slab entry is consuming. By default all slab entries with more
+ than one object in use are printed if the trigger condition is
+ met to dump slab entries.
+
+ If the option is configured it is enabled/disabled by setting
+ the value of the file entry in the debugfs OOM interface at:
+ /sys/kernel/debug/oom/slab_select_print_enabled
+ A value of 1 is enabled (default) and a value of 0 is disabled.
+
+ When enabled entries are print limited by the amount of memory
+ they consume. The setting value defines the minimum memory
+ size consumed and are represented in tenths of a percent.
+ Values supported are 0 to 1000 where 0 allows all entries to be
+ printed, 1 would allow entries using 0.1% or more to be printed,
+ 10 would allow entries using 1% or more of memory to be printed.
+
+ If configured and enabled the rate limiting memory percentage
+ is specified by setting a value in the debugfs OOM interface at:
+ /sys/kernel/debug/oom/slab_select_print_tenthpercent
+ If configured the default settings are set to enabled and
+ print limit value of 10 or 1% of memory.
+
+ If unsure, say N.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index c10d61fe944f..9022297fa2ba 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -438,6 +438,15 @@ static void dump_tasks(struct oom_control *oc)
}
}

+static void oom_kill_unreclaimable_slabs_print(void)
+{
+#ifdef CONFIG_DEBUG_OOM_SLAB_SELECT_PRINT
+ if (oom_kill_debug_unreclaimable_slabs_print())
+ return;
+#endif
+ dump_unreclaimable_slab();
+}
+
static void dump_oom_summary(struct oom_control *oc, struct task_struct *victim)
{
/* one line summary of the oom killer context. */
@@ -464,7 +473,7 @@ static void dump_header(struct oom_control *oc, struct task_struct *p)
else {
show_mem(SHOW_MEM_FILTER_NODES, oc->nodemask);
if (is_dump_unreclaim_slabs())
- dump_unreclaimable_slab();
+ oom_kill_unreclaimable_slabs_print();
}
#ifdef CONFIG_DEBUG_OOM
oom_kill_debug_oom_event_is();
diff --git a/mm/oom_kill_debug.c b/mm/oom_kill_debug.c
index c4a9117633fd..2b5245e1134d 100644
--- a/mm/oom_kill_debug.c
+++ b/mm/oom_kill_debug.c
@@ -165,6 +165,9 @@
#if defined(CONFIG_DEBUG_OOM_ARP_TBL) || defined(CONFIG_DEBUG_OOM_ND_TBL)
#include <net/neighbour.h>
#endif
+#ifdef CONFIG_DEBUG_OOM_SLAB_SELECT_PRINT
+#include "slab.h"
+#endif

#define OOMD_MAX_FNAME 48
#define OOMD_MAX_OPTNAME 32
@@ -214,6 +217,12 @@ static struct oom_debug_option oom_debug_options_table[] = {
.option_name = "nd_table_summary_",
.support_tpercent = false,
},
+#endif
+#ifdef CONFIG_DEBUG_OOM_SLAB_SELECT_PRINT
+ {
+ .option_name = "slab_select_print_",
+ .support_tpercent = true,
+ },
#endif
{}
};
@@ -231,6 +240,9 @@ enum oom_debug_options_index {
#endif
#ifdef CONFIG_DEBUG_OOM_ND_TBL
ND_STATE,
+#endif
+#ifdef CONFIG_DEBUG_OOM_SLAB_SELECT_PRINT
+ SELECT_SLABS_STATE,
#endif
OUT_OF_BOUNDS
};
@@ -361,6 +373,36 @@ static void oom_kill_debug_system_summary_prt(void)
}
#endif /* CONFIG_DEBUG_OOM_SYSTEM_STATE */

+#ifdef CONFIG_DEBUG_OOM_SLAB_SELECT_PRINT
+static inline u16 oom_kill_debug_slabs_tenthpercent(void)
+{
+ return oom_kill_debug_tenthpercent(SELECT_SLABS_STATE);
+}
+
+static void oom_kill_debug_slabs_and_summary_print(void)
+{
+ u16 pcttenth = oom_kill_debug_slabs_tenthpercent();
+ unsigned long minkb = (K(totalram_pages()) * pcttenth) / 1000;
+
+ slab_common_oom_debug_dump_unreclaimable(minkb);
+
+ pr_info("Slabs Total: %lukB Reclaim: %lukB Unreclaim: %lukB\n",
+ K((global_node_page_state(NR_SLAB_RECLAIMABLE) +
+ global_node_page_state(NR_SLAB_UNRECLAIMABLE))),
+ K(global_node_page_state(NR_SLAB_RECLAIMABLE)),
+ K(global_node_page_state(NR_SLAB_UNRECLAIMABLE)));
+}
+
+bool oom_kill_debug_unreclaimable_slabs_print(void)
+{
+ if (oom_kill_debug_enabled(SELECT_SLABS_STATE)) {
+ oom_kill_debug_slabs_and_summary_print();
+ return true;
+ }
+ return false;
+}
+#endif /* CONFIG_DEBUG_OOM_SLAB_SELECT_PRINT */
+
#ifdef CONFIG_DEBUG_OOM_TASKS_SUMMARY
static void oom_kill_debug_tasks_summary_print(void)
{
diff --git a/mm/oom_kill_debug.h b/mm/oom_kill_debug.h
index 7288969db9ce..549b8da179d0 100644
--- a/mm/oom_kill_debug.h
+++ b/mm/oom_kill_debug.h
@@ -9,6 +9,10 @@
#ifndef __MM_OOM_KILL_DEBUG_H__
#define __MM_OOM_KILL_DEBUG_H__

+#ifdef CONFIG_DEBUG_OOM_SLAB_SELECT_PRINT
+extern bool oom_kill_debug_unreclaimable_slabs_print(void);
+#endif
+
extern u32 oom_kill_debug_oom_event_is(void);
extern u32 oom_kill_debug_event(void);
extern bool oom_kill_debug_enabled(u16 index);
diff --git a/mm/slab.h b/mm/slab.h
index 9057b8056b07..703e914efedc 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -586,10 +586,14 @@ int memcg_slab_show(struct seq_file *m, void *p);

#if defined(CONFIG_SLAB) || defined(CONFIG_SLUB_DEBUG)
void dump_unreclaimable_slab(void);
+void slab_common_oom_debug_dump_unreclaimable(unsigned long minkb);
#else
static inline void dump_unreclaimable_slab(void)
{
}
+static inline void slab_common_oom_debug_dump_unreclaimable(unsigned long minkb)
+{
+}
#endif

void ___cache_free(struct kmem_cache *cache, void *x, unsigned long addr);
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 807490fe217a..9ddc95040b60 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1450,6 +1450,71 @@ void dump_unreclaimable_slab(void)
mutex_unlock(&slab_mutex);
}

+#ifdef CONFIG_DEBUG_OOM_SLAB_SELECT_PRINT
+static void oom_debug_slab_header_print(void)
+{
+ pr_info("Unreclaimable slab info:\n");
+ pr_info("Name Used Total\n");
+}
+
+static void oom_debug_slab_print(struct slabinfo *psi, struct kmem_cache *pkc)
+{
+ pr_info("%-17s %10luKB %10luKB\n", cache_name(pkc),
+ (psi->active_objs * pkc->size) / 1024,
+ (psi->num_objs * pkc->size) / 1024);
+}
+
+static bool oom_debug_slab_check(struct slabinfo *psi, struct kmem_cache *pkc,
+ unsigned long min_kb)
+{
+ if (psi->num_objs > 0) {
+ if (((psi->active_objs * pkc->size) / 1024) >= min_kb) {
+ oom_debug_slab_print(psi, pkc);
+ return true;
+ }
+ }
+ return false;
+}
+
+void slab_common_oom_debug_dump_unreclaimable(unsigned long minkb)
+{
+ struct kmem_cache *s, *s2;
+ struct slabinfo sinfo;
+ u32 slabs_examined = 0;
+ u32 slabs_printed = 0;
+
+ /*
+ * Here acquiring slab_mutex is risky since we don't prefer to get
+ * sleep in oom path. But, without mutex hold, it may introduce a
+ * risk of crash.
+ * Use mutex_trylock to protect the list traverse, dump nothing
+ * without acquiring the mutex.
+ */
+ if (!mutex_trylock(&slab_mutex)) {
+ pr_warn("excessive unreclaimable slab but cannot dump stats\n");
+ return;
+ }
+
+ oom_debug_slab_header_print();
+
+ list_for_each_entry_safe(s, s2, &slab_caches, list) {
+ if (!is_root_cache(s) || (s->flags & SLAB_RECLAIM_ACCOUNT))
+ continue;
+
+ get_slabinfo(s, &sinfo);
+
+ ++slabs_examined;
+
+ if (oom_debug_slab_check(&sinfo, s, minkb))
+ ++slabs_printed;
+ }
+ mutex_unlock(&slab_mutex);
+
+ pr_info("Summary: Slab entries examined:%u printed:%u minsize:%lukB\n",
+ slabs_examined, slabs_printed, minkb);
+}
+#endif /* CONFIG_DEBUG_OOM_SLAB_SELECT_PRINT */
+
#if defined(CONFIG_MEMCG)
void *memcg_slab_start(struct seq_file *m, loff_t *pos)
{
--
2.20.1

2019-08-26 21:49:10

by Edward Chron

[permalink] [raw]
Subject: [PATCH 06/10] mm/oom_debug: Add Select Vmalloc Entries Print

Add OOM Debug code to allow select vmalloc entries to be printed output
at the time of an OOM event. Listing some portion of the larger vmalloc
entries has proven useful in tracking memory usage during an OOM event
so the root cause of the event can be determined.

Configuring this OOM Debug Option (DEBUG_OOM_VMALLOC_SELECT_PRINT)
------------------------------------------------------------------
To configure this option it needs to be selected in the OOM Debugging
configure menu. The kernel configuration entry can be found in the
config at: Kernel hacking, Memory Debugging, OOM Debugging with the
DEBUG_OOM_VMALLOC_SELECT_PRINT config entry that configures this option.

Two dynamic OOM debug settings for this option: enable, tenthpercent
--------------------------------------------------------------------
The oom debugfs base directory is found at: /sys/kernel/debug/oom.
The oom debugfs for this option is: vmalloc_select_print_
and for select options there are two files, the enable file and
the tenthpercent file are the debugfs files.

Dynamic disable or re-enable this OOM Debug option
--------------------------------------------------
This option may be disabled or re-enabled using the debugfs entry for
this OOM debug option. The debugfs file to enable this entry is found
at: /sys/kernel/debug/oom/vmalloc_select_print_enabled where the enabled
file's value determines whether the facility is enabled or disabled.
A value of 1 is enabled (default) and a value of 0 is disabled.

Specifying the minimum entry size (0-1000) in the tenthpercent file
-------------------------------------------------------------------
Also for DEBUG_OOM_VMALLOC_SELECT_PRINT the number of vmalloc entries
printed can be adjusted. By default if the DEBUG_OOM_VMALLOC_SELECT_PRINT
config option is enabled only entries that use 1% or more of memory are
printed. This can be adjusted to be entries as small as 0% of memory
or as large as 100% of memory in which case only a summary line is
printed, as no vmalloc entry could possibly use 100% of memory.
Adjustments are made through the debugfs file found at:
/sys/kernel/debug/oom/vmalloc_select_print_tenthpercent
Entry values that are valid are 0 through 1000 which represent memory
usage of 0% of memory to 100% of memory. Only entries that are using
at least one page of memory are printed even if the minimum entry
size is specified as 0, zero page entries have no memory assigned.

Content of Vmalloc entry records and Vmalloc summary record
-----------------------------------------------------------
The output is vmalloc entry information output limited such that only
entries equal to or larger than the minimum size are printed.
Unused vmallocs (no pages assigned to the vmalloc) are never printed.
The vmalloc entry information includes:
- Size (in bytes)
- pages (Number pages in use)
- Caller Information to identify the request

Additional output consists of summary information that is printed
at the end of the output. This summary information includes:
- Number of Vmalloc entries examined
- Number of Vmalloc entries printed
- minimum entry size for selection

Sample Output
-------------
Output produced consists of one line of output for each vmalloc entry
that is equal to or larger than the minimum entry size specified
by the percent_totalpages_print_limit (0% to 100.0%) followed by
one line of summary output. There is also a section header output
line and a summary line that are printed.

Sample Vmalloc entries section header:

Aug 19 19:27:01 coronado kernel: Vmalloc Info:

Sample per entry selected print line output:

Jul 22 20:16:09 yoursystem kernel: Vmalloc size=2625536 pages=640
caller=__do_sys_swapon+0x78e/0x1130

Sample summary print line output:

Jul 22 19:03:26 yoursystem kernel: Summary: Vmalloc entries examined:1070
printed:989 minsize:0kB


Signed-off-by: Edward Chron <[email protected]>
---
include/linux/vmalloc.h | 12 ++++++++++++
mm/Kconfig.debug | 28 +++++++++++++++++++++++++++
mm/oom_kill_debug.c | 21 ++++++++++++++++++++
mm/vmalloc.c | 43 +++++++++++++++++++++++++++++++++++++++++
4 files changed, 104 insertions(+)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 9b21d0047710..09e3257fc382 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -227,4 +227,16 @@ pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms)
int register_vmap_purge_notifier(struct notifier_block *nb);
int unregister_vmap_purge_notifier(struct notifier_block *nb);

+#ifdef CONFIG_DEBUG_OOM_VMALLOC_SELECT_PRINT
+/**
+ * Routine used to print select vmalloc entries on an OOM event so we
+ * can identify sizeable entries that may have a significant effect on
+ * kernel memory utilization. Output goes to dmesg along with all the OOM
+ * related messages when the config option DEBUG_OOM_VMALLOC_SELECT_PRINT
+ * is set to yes. The Option may be dyanmically enabled or disabled and
+ * the selection size is also dynamically configureable.
+ */
+extern void vmallocinfo_oom_print(unsigned long min_kb);
+#endif /* CONFIG_DEBUG_OOM_VMALLOC_SELECT_PRINT */
+
#endif /* _LINUX_VMALLOC_H */
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index c7d53ca95d32..ea3465343286 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -219,3 +219,31 @@ config DEBUG_OOM_SLAB_SELECT_PRINT
print limit value of 10 or 1% of memory.

If unsure, say N.
+
+config DEBUG_OOM_VMALLOC_SELECT_PRINT
+ bool "Debug OOM Select Vmallocs Print"
+ depends on DEBUG_OOM
+ help
+ When enabled, allows the number of vmalloc entries printed
+ to be print rate limited based on the amount of memory the
+ vmalloc entry is consuming.
+
+ If the option is configured it is enabled/disabled by setting
+ the value of the file entry in the debugfs OOM interface at:
+ /sys/kernel/debug/oom/vmalloc_select_print_enabled
+ A value of 1 is enabled (default) and a value of 0 is disabled.
+
+ When enabled entries are print limited by the amount of memory
+ they consume. The setting value defines the minimum memory
+ size consumed and are represented in tenths of a percent.
+ Values supported are 0 to 1000 where 0 allows all entries to be
+ printed, 1 would allow entries using 0.1% or more to be printed,
+ 10 would allow entries using 1% or more of memory to be printed.
+
+ If configured and enabled the rate limiting memory percentage
+ is specified by setting a value in the debugfs OOM interface at:
+ /sys/kernel/debug/oom/vmalloc_select_print_tenthpercent
+ If configured the default settings are set to enabled and
+ print limit value of 10 or 1% of memory.
+
+ If unsure, say N.
diff --git a/mm/oom_kill_debug.c b/mm/oom_kill_debug.c
index 2b5245e1134d..d5e37f8508e6 100644
--- a/mm/oom_kill_debug.c
+++ b/mm/oom_kill_debug.c
@@ -168,6 +168,9 @@
#ifdef CONFIG_DEBUG_OOM_SLAB_SELECT_PRINT
#include "slab.h"
#endif
+#ifdef CONFIG_DEBUG_OOM_VMALLOC_SELECT_PRINT
+#include <linux/vmalloc.h>
+#endif

#define OOMD_MAX_FNAME 48
#define OOMD_MAX_OPTNAME 32
@@ -223,6 +226,12 @@ static struct oom_debug_option oom_debug_options_table[] = {
.option_name = "slab_select_print_",
.support_tpercent = true,
},
+#endif
+#ifdef CONFIG_DEBUG_OOM_VMALLOC_SELECT_PRINT
+ {
+ .option_name = "vmalloc_select_print_",
+ .support_tpercent = true,
+ },
#endif
{}
};
@@ -243,6 +252,9 @@ enum oom_debug_options_index {
#endif
#ifdef CONFIG_DEBUG_OOM_SLAB_SELECT_PRINT
SELECT_SLABS_STATE,
+#endif
+#ifdef CONFIG_DEBUG_OOM_VMALLOC_SELECT_PRINT
+ SELECT_VMALLOC_STATE,
#endif
OUT_OF_BOUNDS
};
@@ -431,6 +443,15 @@ u32 oom_kill_debug_oom_event_is(void)
neightbl_print_stats("nd_tbl", &nd_tbl);
#endif

+#ifdef CONFIG_DEBUG_OOM_VMALLOC_SELECT_PRINT
+ if (oom_kill_debug_enabled(SELECT_VMALLOC_STATE)) {
+ u16 ptenth = oom_kill_debug_tenthpercent(SELECT_VMALLOC_STATE);
+ unsigned long minkb = (K(totalram_pages()) * ptenth) / 1000;
+
+ vmallocinfo_oom_print(minkb);
+ }
+#endif
+
#ifdef CONFIG_DEBUG_OOM_TASKS_SUMMARY
if (oom_kill_debug_enabled(TASKS_STATE))
oom_kill_debug_tasks_summary_print();
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 7ba11e12a11f..2cdc0f0cd0af 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3523,4 +3523,47 @@ static int __init proc_vmalloc_init(void)
}
module_init(proc_vmalloc_init);

+#ifdef CONFIG_DEBUG_OOM_VMALLOC_SELECT_PRINT
+#define K(x) ((x) << (PAGE_SHIFT-10))
+/*
+ * Routine used to print select vmalloc entries on an OOM condition so
+ * we can identify sizeable entries that may have a significant effect on
+ * kernel memory utilization. Output goes to dmesg along with all the OOM
+ * related messages when the config option DEBUG_OOM_VMALLOC_SELECT_PRINT
+ * is set to yes. Both enable / disable and size selection value are
+ * dynamically configurable.
+ */
+void vmallocinfo_oom_print(unsigned long min_kb)
+{
+ struct vmap_area *vap;
+ struct vm_struct *vsp;
+ u_int32_t entries = 0;
+ u_int32_t printed = 0;
+
+ if (!spin_trylock(&vmap_area_lock)) {
+ pr_info("Vmalloc Info: Skipped, vmap_area_lock not available\n");
+ return;
+ }
+
+ pr_info("Vmalloc Info:\n");
+ list_for_each_entry(vap, &vmap_area_list, list) {
+ if (!(vap->flags & VM_VM_AREA))
+ continue;
+ ++entries;
+ vsp = vap->vm;
+ if ((vsp->nr_pages > 0) && (K(vsp->nr_pages) >= min_kb)) {
+ pr_info("vmalloc size=%ld pages=%d caller=%pS\n",
+ vsp->size, vsp->nr_pages, vsp->caller);
+ ++printed;
+ }
+ }
+
+ spin_unlock(&vmap_area_lock);
+
+ pr_info("Summary: Vmalloc entries examined:%u printed:%u minsize:%lukB\n",
+ entries, printed, min_kb);
+}
+EXPORT_SYMBOL(vmallocinfo_oom_print);
+#endif /* CONFIG_DEBUG_OOM_VMALLOC_SELECT_PRINT */
+
#endif
--
2.20.1

2019-08-26 21:49:27

by Edward Chron

[permalink] [raw]
Subject: [PATCH 07/10] mm/oom_debug: Add Select Process Entries Print

Add OOM Debug code to selectively print an entry for each user
process that was considered for OOM selection at the time of an
OOM event. Limiting the processes to print is done by specifying a
minimum amount of memory that must be used to be eligible to be
printed.

Note: memory usage for oom processes includes RAM memory as well
as swap space. The value totalpages is actually used as the memory
size for determining percentage of "memory" used.

Configuring this OOM Debug Option (DEBUG_OOM_PROCESS_SELECT_PRINT)
------------------------------------------------------------------
To configure this option it needs to be selected in the OOM
Debugging configure menu. The kernel configuration entry for this
option can be found in the config file at: Kernel hacking, Memory
Debugging, OOM Debugging the DEBUG_OOM_PROCESS_SELECT config entry.

Two dynamic OOM debug settings for this option: enable, tenthpercent
--------------------------------------------------------------------
The oom debugfs base directory is found at: /sys/kernel/debug/oom.
The oom debugfs for this option is: process_select_print_
and for select options there are two files, the enable file and
the tenthpercent file are the debugfs files.

Dynamic disable or re-enable this OOM Debug option
--------------------------------------------------
This option may be disabled or re-enabled using the debugfs entry for
this OOM debug option. The debugfs file to enable this entry is found at:
/sys/kernel/debug/oom/process_select_print_enabled where the enabled
file's value determines whether the facility is enabled or disabled.
A value of 1 is enabled (default) and a value of 0 is disabled.

Specifying the minimum entry size (0-1000) in the tenthpercent file
-------------------------------------------------------------------
For DEBUG_OOM_PROCESS_SELECT_PRINT the processes printed can be limited
by specifying the minimum percentage of memory usage to be eligible to
be printed. By default if the DEBUG_OOM_PROCESS_SELECT config option is
enabled only OOM considered processes that use 1% or more of memory are
printed. This can be adjusted to be entries as small as 0.1% of memory
or as large as 100% of memory in which case only a summary line is
printed, as no process could possibly consume 100% of the memory.
Adjustments are made through the debugfs file found at:
/sys/kernel/debug/oom/procs_select_print_tenthpercent
valid values include values 1 through 1000 which represent memory
usage of 0.1% of memory to 100% of totalpages. Also specifying a value
of zero is a valid value and when specified it prints an entry for all
OOM considered processes even if they use essentially no memory.

Sample Output
-------------
Output produced consists of one line of standard Linux OOM process
entry output for each process that is equal to or larger than the
minimum entry size specified by the percent_totalpages_print_limit
(0% to 100.0%) followed by one line of summary output.

Summary print line output (minsize = 0.1% of totalpages):

Aug 13 20:16:30 yourserver kernel: Summary: OOM Tasks considered:245
printed:33 minimum size:32576kB total-pages:32579084kB


Signed-off-by: Edward Chron <[email protected]>
---
include/linux/oom.h | 1 +
mm/Kconfig.debug | 34 ++++++++++++++++++++++++++++++++++
mm/oom_kill.c | 39 +++++++++++++++++++++++++++++++--------
mm/oom_kill_debug.c | 22 ++++++++++++++++++++++
mm/oom_kill_debug.h | 3 +++
5 files changed, 91 insertions(+), 8 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index c696c265f019..f37af4772452 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -49,6 +49,7 @@ struct oom_control {
unsigned long totalpages;
struct task_struct *chosen;
unsigned long chosen_points;
+ unsigned long minpgs;

/* Used to print the constraint info. */
enum oom_constraint constraint;
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index ea3465343286..0c5feb0e15a9 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -247,3 +247,37 @@ config DEBUG_OOM_VMALLOC_SELECT_PRINT
print limit value of 10 or 1% of memory.

If unsure, say N.
+
+config DEBUG_OOM_PROCESS_SELECT_PRINT
+ bool "Debug OOM Select Process Print"
+ depends on DEBUG_OOM
+ help
+ When enabled, allows OOM considered process OOM information
+ to be print rate limited based on the amount of memory the
+ considered process is consuming. The number of processes that
+ were considered for OOM selection, the number of processes
+ that were actually printed and the minimum memory usage
+ percentage that was used to select to which processes are
+ printed is printed in a summary line after printing the
+ selected tasks.
+
+ If the option is configured it is enabled/disabled by setting
+ the value of the file entry in the debugfs OOM interface at:
+ /sys/kernel/debug/oom/process_select_print_enabled
+ A value of 1 is enabled (default) and a value of 0 is disabled.
+
+ When enabled entries are print limited by the amount of memory
+ they consume. The setting value defines the minimum memory
+ size consumed and are represented in tenths of a percent.
+ Values supported are 0 to 1000 where 0 allows all OOM considered
+ processes to be printed, 1 would allow entries using 0.1% or
+ more to be printed, 10 would allow entries using 1% or more of
+ memory to be printed.
+
+ If configured and enabled the rate limiting OOM process selection
+ is specified by setting a value in the debugfs OOM interface at:
+ /sys/kernel/debug/oom/process_select_print_tenthpercent
+ If configured the default settings are set to enabled and
+ print limit value of 10 or 1% of memory.
+
+ If unsure, say N.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9022297fa2ba..4b37318dce4f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -380,6 +380,7 @@ static void select_bad_process(struct oom_control *oc)

static int dump_task(struct task_struct *p, void *arg)
{
+ unsigned long rsspgs, swappgs, pgtbl;
struct oom_control *oc = arg;
struct task_struct *task;

@@ -400,17 +401,29 @@ static int dump_task(struct task_struct *p, void *arg)
return 0;
}

+ rsspgs = get_mm_rss(task->mm);
+ swappgs = get_mm_counter(p->mm, MM_SWAPENTS);
+ pgtbl = mm_pgtables_bytes(p->mm);
+
+#ifdef CONFIG_DEBUG_OOM_PROCESS_SELECT_PRINT
+ if ((oc->minpgs > 0) &&
+ ((rsspgs + swappgs + pgtbl / PAGE_SIZE) < oc->minpgs)) {
+ task_unlock(task);
+ return 0;
+ }
+#endif
+
pr_info("[%7d] %5d %5d %8lu %8lu %8ld %8lu %5hd %s\n",
task->pid, from_kuid(&init_user_ns, task_uid(task)),
- task->tgid, task->mm->total_vm, get_mm_rss(task->mm),
- mm_pgtables_bytes(task->mm),
- get_mm_counter(task->mm, MM_SWAPENTS),
+ task->tgid, task->mm->total_vm, rsspgs, pgtbl, swappgs,
task->signal->oom_score_adj, task->comm);
task_unlock(task);

- return 0;
+ return 1;
}

+#define K(x) ((x) << (PAGE_SHIFT-10))
+
/**
* dump_tasks - dump current memory state of all system tasks
* @oc: pointer to struct oom_control
@@ -423,19 +436,31 @@ static int dump_task(struct task_struct *p, void *arg)
*/
static void dump_tasks(struct oom_control *oc)
{
+ u32 total = 0;
+ u32 prted = 0;
+
pr_info("Tasks state (memory values in pages):\n");
pr_info("[ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name\n");

+#ifdef CONFIG_DEBUG_OOM_PROCESS_SELECT_PRINT
+ oc->minpgs = oom_kill_debug_min_task_pages(oc->totalpages);
+#endif
+
if (is_memcg_oom(oc))
mem_cgroup_scan_tasks(oc->memcg, dump_task, oc);
else {
struct task_struct *p;

rcu_read_lock();
- for_each_process(p)
- dump_task(p, oc);
+ for_each_process(p) {
+ ++total;
+ prted += dump_task(p, oc);
+ }
rcu_read_unlock();
}
+
+ pr_info("Summary: OOM Tasks considered:%u printed:%u minimum size:%lukB totalpages:%lukB\n",
+ total, prted, K(oc->minpgs), K(oc->totalpages));
}

static void oom_kill_unreclaimable_slabs_print(void)
@@ -492,8 +517,6 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);

static bool oom_killer_disabled __read_mostly;

-#define K(x) ((x) << (PAGE_SHIFT-10))
-
/*
* task->mm can be NULL if the task is the exited group leader. So to
* determine whether the task is using a particular mm, we examine all the
diff --git a/mm/oom_kill_debug.c b/mm/oom_kill_debug.c
index d5e37f8508e6..66b745039771 100644
--- a/mm/oom_kill_debug.c
+++ b/mm/oom_kill_debug.c
@@ -232,6 +232,12 @@ static struct oom_debug_option oom_debug_options_table[] = {
.option_name = "vmalloc_select_print_",
.support_tpercent = true,
},
+#endif
+#ifdef CONFIG_DEBUG_OOM_PROCESS_SELECT_PRINT
+ {
+ .option_name = "process_select_print_",
+ .support_tpercent = true,
+ },
#endif
{}
};
@@ -255,6 +261,9 @@ enum oom_debug_options_index {
#endif
#ifdef CONFIG_DEBUG_OOM_VMALLOC_SELECT_PRINT
SELECT_VMALLOC_STATE,
+#endif
+#ifdef CONFIG_DEBUG_OOM_PROCESS_SELECT_PRINT
+ SELECT_PROCESS_STATE,
#endif
OUT_OF_BOUNDS
};
@@ -415,6 +424,19 @@ bool oom_kill_debug_unreclaimable_slabs_print(void)
}
#endif /* CONFIG_DEBUG_OOM_SLAB_SELECT_PRINT */

+#ifdef CONFIG_DEBUG_OOM_PROCESS_SELECT_PRINT
+unsigned long oom_kill_debug_min_task_pages(unsigned long totalpages)
+{
+ u16 pct;
+
+ if (!oom_kill_debug_enabled(SELECT_PROCESS_STATE))
+ return 0;
+
+ pct = oom_kill_debug_tenthpercent(SELECT_PROCESS_STATE);
+ return (totalpages * pct) / 1000;
+}
+#endif /* CONFIG_DEBUG_OOM_PROCESS_SELECT_PRINT */
+
#ifdef CONFIG_DEBUG_OOM_TASKS_SUMMARY
static void oom_kill_debug_tasks_summary_print(void)
{
diff --git a/mm/oom_kill_debug.h b/mm/oom_kill_debug.h
index 549b8da179d0..7eec861a0009 100644
--- a/mm/oom_kill_debug.h
+++ b/mm/oom_kill_debug.h
@@ -9,6 +9,9 @@
#ifndef __MM_OOM_KILL_DEBUG_H__
#define __MM_OOM_KILL_DEBUG_H__

+#ifdef CONFIG_DEBUG_OOM_PROCESS_SELECT_PRINT
+extern unsigned long oom_kill_debug_min_task_pages(unsigned long totalpages);
+#endif
#ifdef CONFIG_DEBUG_OOM_SLAB_SELECT_PRINT
extern bool oom_kill_debug_unreclaimable_slabs_print(void);
#endif
--
2.20.1

2019-08-26 21:49:44

by Edward Chron

[permalink] [raw]
Subject: [PATCH 03/10] mm/oom_debug: Add Tasks Summary

Adds config option and code to support printing a Process / Thread Summary
of process / thread activity when an OOM event occurs. The information
provided includes the number of process and threads active, the number
of oom eligible and oom ineligible tasks, the total number of forks
that have happened since the system booted and the number of runnable
and I/O blocked processes. All values are at the time of the OOM event.

Configuring this Debug Option (DEBUG_OOM_TASKS_SUMMARY)
-------------------------------------------------------
To get the tasks information summary this option must be configured.
The Tasks Summary option uses the CONFIG_DEBUG_OOM_TASKS_SUMMARY
kernel config option which is found in the kernel config under the entry:
Kernel hacking, Memory Debugging, OOM Debugging entry. The config option
to select is: DEBUG_OOM_TASKS_SUMMARY.

Dynamic disable or re-enable this OOM Debug option
--------------------------------------------------
The oom debugfs base directory is found at: /sys/kernel/debug/oom.
The oom debugfs for this option is: tasks_summary_
and there is just one file for this option, the enable file.

The option may be disabled or re-enabled using the debugfs entry for
this OOM debug option. The debugfs file to enable this option is found at:
/sys/kernel/debug/oom/tasks_summary_enabled
The option's enabled file value determines whether the facility is enabled
or disabled. A value of 1 is enabled (default) and a value of 0 is
disabled. When configured the default setting is set to enabled.

Content and format of Tasks Summary Output
------------------------------------------
One line of output that includes:
- Number of Threads
- Number of processes
- Forks since boot
- Processes that are runnable
- Processes that are in iowait

Sample Output:
-------------
Sample Tasks Summary message output:

Aug 13 18:52:48 yoursystem kernel: Threads: 492 Processes: 248
forks_since_boot: 7786 procs_runable: 4 procs_iowait: 0


Signed-off-by: Edward Chron <[email protected]>
---
mm/Kconfig.debug | 16 ++++++++++++++++
mm/oom_kill_debug.c | 27 +++++++++++++++++++++++++++
2 files changed, 43 insertions(+)

diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index dbe599b67a3b..fcbc5f9aa146 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -147,3 +147,19 @@ config DEBUG_OOM_SYSTEM_STATE
A value of 1 is enabled (default) and a value of 0 is disabled.

If unsure, say N.
+
+config DEBUG_OOM_TASKS_SUMMARY
+ bool "Debug OOM System Tasks Summary"
+ depends on DEBUG_OOM
+ help
+ When enabled, provides a kernel process/thread summary recording
+ the system's process/thread activity at the time an OOM event.
+ The number of processes and of threads, the number of runnable
+ and I/O blocked threads, the number of forks since boot and the
+ number of oom eligible and oom ineligble tasks are provided in
+ the output. If configured it is enabled/disabled by setting the
+ enabled file entry in the debugfs OOM interface at:
+ /sys/kernel/debug/oom/tasks_summary_enabled
+ A value of 1 is enabled (default) and a value of 0 is disabled.
+
+ If unsure, say N.
diff --git a/mm/oom_kill_debug.c b/mm/oom_kill_debug.c
index 6eeaad86fca8..395b3307f822 100644
--- a/mm/oom_kill_debug.c
+++ b/mm/oom_kill_debug.c
@@ -152,6 +152,10 @@
#include <linux/sched/stat.h>
#endif

+#ifdef CONFIG_DEBUG_OOM_TASKS_SUMMARY
+#include <linux/sched/stat.h>
+#endif
+
#define OOMD_MAX_FNAME 48
#define OOMD_MAX_OPTNAME 32

@@ -182,6 +186,12 @@ static struct oom_debug_option oom_debug_options_table[] = {
.option_name = "system_state_summary_",
.support_tpercent = false,
},
+#endif
+#ifdef CONFIG_DEBUG_OOM_TASKS_SUMMARY
+ {
+ .option_name = "tasks_summary_",
+ .support_tpercent = false,
+ },
#endif
{}
};
@@ -190,6 +200,9 @@ static struct oom_debug_option oom_debug_options_table[] = {
enum oom_debug_options_index {
#ifdef CONFIG_DEBUG_OOM_SYSTEM_STATE
SYSTEM_STATE,
+#endif
+#ifdef CONFIG_DEBUG_OOM_TASKS_SUMMARY
+ TASKS_STATE,
#endif
OUT_OF_BOUNDS
};
@@ -320,6 +333,15 @@ static void oom_kill_debug_system_summary_prt(void)
}
#endif /* CONFIG_DEBUG_OOM_SYSTEM_STATE */

+#ifdef CONFIG_DEBUG_OOM_TASKS_SUMMARY
+static void oom_kill_debug_tasks_summary_print(void)
+{
+ pr_info("Threads:%d Processes:%d forks_since_boot:%lu procs_runable:%lu procs_iowait:%lu\n",
+ nr_threads, nr_processes(),
+ total_forks, nr_running(), nr_iowait());
+}
+#endif /* CONFIG_DEBUG_OOM_TASKS_SUMMARY */
+
u32 oom_kill_debug_oom_event_is(void)
{
++oom_kill_debug_oom_events;
@@ -329,6 +351,11 @@ u32 oom_kill_debug_oom_event_is(void)
oom_kill_debug_system_summary_prt();
#endif

+#ifdef CONFIG_DEBUG_OOM_TASKS_SUMMARY
+ if (oom_kill_debug_enabled(TASKS_STATE))
+ oom_kill_debug_tasks_summary_print();
+#endif
+
return oom_kill_debug_oom_events;
}

--
2.20.1

2019-08-27 07:16:48

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 00/10] OOM Debug print selection and additional information

On Mon 26-08-19 12:36:28, Edward Chron wrote:
[...]
> Extensibility using OOM debug options
> -------------------------------------
> What is needed is an extensible system to optionally configure
> debug options as needed and to then dynamically enable and disable
> them. Also for options that produce multiple lines of entry based
> output, to configure which entries to print based on how much
> memory they use (or optionally all the entries).

With a patch this large and adding a lot of new stuff we need a more
detailed usecases described I believe.

[...]

> Use of debugfs to allow dynamic controls
> ----------------------------------------
> By providing a debugfs interface that allows options to be configured,
> enabled and where appropriate to set a minimum size for selecting
> entries to print, the output produced when an OOM event occurs can be
> dynamically adjusted to produce as little or as much detail as needed
> for a given system.

Who is going to consume this information and why would that consumer be
unreasonable to demand further maintenance of that information in future
releases? In other words debugfs is not considered a stableAPI which is
OK here but the side effect of any change to these files results in user
visible behavior and we consider that more or less a stable as long as
there are consumers.

> OOM debug options can be added to the base code as needed.
>
> Currently we have the following OOM debug options defined:
>
> * System State Summary
> --------------------
> One line of output that includes:
> - Uptime (days, hour, minutes, seconds)

We do have timestamps in the log so why is this needed?

> - Number CPUs
> - Machine Type
> - Node name
> - Domain name

why are these needed? That is a static information that doesn't really
influence the OOM situation.

> - Kernel Release
> - Kernel Version

part of the oom report
>
> Example output when configured and enabled:
>
> Jul 27 10:56:46 yoursystem kernel: System Uptime:0 days 00:17:27 CPUs:4 Machine:x86_64 Node:yoursystem Domain:localdomain Kernel Release:5.3.0-rc2+ Version: #49 SMP Mon Jul 27 10:35:32 PDT 2019
>
> * Tasks Summary
> -------------
> One line of output that includes:
> - Number of Threads
> - Number of processes
> - Forks since boot
> - Processes that are runnable
> - Processes that are in iowait

We do have sysrq+t for this kind of information. Why do we need to
duplicate it?

> Example output when configured and enabled:
>
> Jul 22 15:20:57 yoursystem kernel: Threads:530 Processes:279 forks_since_boot:2786 procs_runable:2 procs_iowait:0
>
> * ARP Table and/or Neighbour Discovery Table Summary
> --------------------------------------------------
> One line of output each for ARP and ND that includes:
> - Table name
> - Table size (max # entries)
> - Key Length
> - Entry Size
> - Number of Entries
> - Last Flush (in seconds)
> - hash grows
> - entry allocations
> - entry destroys
> - Number lookups
> - Number of lookup hits
> - Resolution failures
> - Garbage Collection Forced Runs
> - Table Full
> - Proxy Queue Length
>
> Example output when configured and enabled (for both):
>
> ... kernel: neighbour: Table: arp_tbl size: 256 keyLen: 4 entrySize: 360 entries: 9 lastFlush: 1721s hGrows: 1 allocs: 9 destroys: 0 lookups: 204 hits: 199 resFailed: 38 gcRuns/Forced: 111 / 0 tblFull: 0 proxyQlen: 0
>
> ... kernel: neighbour: Table: nd_tbl size: 128 keyLen: 16 entrySize: 368 entries: 6 lastFlush: 1720s hGrows: 0 allocs: 7 destroys: 1 lookups: 0 hits: 0 resFailed: 0 gcRuns/Forced: 110 / 0 tblFull: 0 proxyQlen: 0

Again, why is this needed particularly for the OOM event? I do
understand this might be useful system health diagnostic information but
how does this contribute to the OOM?

> * Add Select Slabs Print
> ----------------------
> Allow select slab entries (based on a minimum size) to be printed.
> Minimum size is specified as a percentage of the total RAM memory
> in tenths of a percent, consistent with existing OOM process scoring.
> Valid values are specified from 0 to 1000 where 0 prints all slab
> entries (all slabs that have at least one slab object in use) up
> to 1000 which would require a slab to use 100% of memory which can't
> happen so in that case only summary information is printed.
>
> The first line of output is the standard Linux output header for
> OOM printed Slab entries. This header looks like this:
>
> Aug 6 09:37:21 egc103 yourserver: Unreclaimable slab info:
>
> The output is existing slab entry memory usage limited such that only
> entries equal to or larger than the minimum size are printed.
> Empty slabs (no slab entries in slabs in use) are never printed.
>
> Additional output consists of summary information that is printed
> at the end of the output. This summary information includes:
> - # entries examined
> - # entries selected and printed
> - minimum entry size for selection
> - Slabs total size (kB)
> - Slabs reclaimable size (kB)
> - Slabs unreclaimable size (kB)
>
> Example Summary output when configured and enabled:
>
> Jul 23 23:26:34 yoursystem kernel: Summary: Slab entries examined: 123 printed: 83 minsize: 0kB
>
> Jul 23 23:26:34 yoursystem kernel: Slabs Total: 151212kB Reclaim: 50632kB Unreclaim: 100580kB

I am all for practical improvements for slab reporting. It is not really
trivial to find a good balance though. Printing all the caches simply
doesn't scale. So I would start by improving the current state rather
than adding more configurability.

>
> * Add Select Vmalloc allocations Print
> ------------------------------------
> Allow select vmalloc entries (based on a minimum size) to be printed.
> Minimum size is specified as a percentage of the total RAM memory
> in tenths of a percent, consistent with existing OOM process scoring.
> Valid values are specified from 0 to 1000 where 0 prints all vmalloc
> entries (all vmalloc allocations that have at least one page in use) up
> to 1000 which would require a vmalloc to use 100% of memory which can't
> happen so in that case only summary information is printed.
>
> The first line of output is a new Vmalloc output header for
> OOM printed Vmalloc entries. This header looks like this:
>
> Aug 19 19:27:01 yourserver kernel: Vmalloc Info:
>
> The output is vmalloc entry information output limited such that only
> entries equal to or larger than the minimum size are printed.
> Unused vmallocs (no pages assigned to the vmalloc) are never printed.
> The vmalloc entry information includes:
> - Size (in bytes)
> - pages (Number pages in use)
> - Caller Information to identify the request
>
> A sample vmalloc entry output looks like this:
>
> Jul 22 20:16:09 yoursystem kernel: Vmalloc size=2625536 pages=640 caller=__do_sys_swapon+0x78e/0x113
>
> Additional output consists of summary information that is printed
> at the end of the output. This summary information includes:
> - Number of Vmalloc entries examined
> - Number of Vmalloc entries printed
> - minimum entry size for selection
>
> A sample Vmalloc Summary output looks like this:
>
> Aug 19 19:27:01 coronado kernel: Summary: Vmalloc entries examined: 1070 printed: 989 minsize: 0kB

This is a lot of information. I wouldn't be surprised if this alone
could easily overflow the ringbuffer. Besides that, it is rarely useful
for the OOM situation debugging. The overall size of the vmalloc area
is certainly interesting but I am not sure we have a handy counter to
cope with constrained OOM contexts.

> * Add Select Process Entries Print
> --------------------------------
> Allow select process entries (based on a minimum size) to be printed.
> Minimum size is specified as a percentage totalpages (RAM + swap)
> in tenths of a percent, consistent with existing OOM process scoring.
> Note: user process memory can be swapped out when swap space present
> so that is why swap space and ram memory comprise the totalpages
> used to calculate the percentage of memory a process is using.
> Valid values are specified from 0 to 1000 where 0 prints all user
> processes (that have valid mm sections and aren't exiting) up to
> 1000 which would require a user process to use 100% of memory which
> can't happen so in that case only summary information is printed.
>
> The first line of output is the standard Linux output headers for
> OOM printed User Processes. This header looks like this:
>
> Aug 19 19:27:01 yourserver kernel: Tasks state (memory values in pages):
> Aug 19 19:27:01 yourserver kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
>
> The output is existing per user process data limited such that only
> entries equal to or larger than the minimum size are printed.
>
> Jul 21 20:07:48 yourserver kernel: [ 579] 0 579 7942 1010 90112 0 -1000 systemd-udevd
>
> Additional output consists of summary information that is printed
> at the end of the output. This summary information includes:
>
> Aug 19 19:27:01 yourserver kernel: Summary: OOM Tasks considered:277 printed:143 minimum size:0kB totalpages:32791608kB

This sounds like a good idea to limit the eligible process list size but
I am concerned that it might get misleading easily when there are many
small processes contributing to the OOM in the end.

> * Add Enhanced Process Print Information
> --------------------------------------
> Add OOM Debug code that prints additional detailed information about
> users processes that were considered for OOM killing for any print
> selected processes. The information is displayed for each user process
> that OOM prints in the output.
>
> This supplemental per user process information is very helpful for
> determing how process memory is used to allow OOM event root cause
> identifcation that might not otherwise be possible.
>
> Output information for enhanced user process entrys printed includes:
> - pid
> - parent pid
> - ruid
> - euid
> - tgid
> - Process State (S)
> - utime in seconds
> - stime in seconds
> - oom_score_adjust
> - task comm value (name of process)
> - Vmem KiB
> - MaxRss KiB
> - CurRss KiB
> - Pte KiB
> - Swap KiB
> - Sock KiB
> - Lib KiB
> - Text KiB
> - Heap KiB
> - Stack KiB
> - File KiB
> - Shmem KiB
> - Read Pages
> - Fault Pages
> - Lock KiB
> - Pinned KiB

I can see some of these being interesting but I would rather pick up
those and add to the regular oom output rather than go over configuring
them.

> Configuring Patches:
> -------------------
> OOM Debug and any options you want to use must first be configured so
> the code is included in your kernel. This requires selecting kernel
> config file options. You will find config options to select under:
>
> Kernel hacking ---> Memory Debugging --->
>
> [*] Debug OOM
> [*] Debug OOM System State
> [*] Debug OOM System Tasks Summary
> [*] Debug OOM ARP Table
> [*] Debug OOM ND Table
> [*] Debug OOM Select Slabs Print
> [*] Debug OOM Slabs Select Always Print Enable
> [*] Debug OOM Enhanced Slab Print
> [*] Debug OOM Select Vmallocs Print
> [*] Debug OOM Select Process Print
> [*] Debug OOM Enhanced Process Print

I really dislike these though. We already have zillions of debugging
options and the config space is enormous. Different combinations of them
make any compile testing a challenge and a lot of cpu cycles eaten.
Besides that, who is going to configure those in without using them
directly? Distributions are not going to enable without having all
options being disabled by default for example.

> 12 files changed, 1339 insertions(+), 11 deletions(-)

This must have a been a lot of work and I really appreciate that.

On the other hand it is a lot of code to maintain (note that you are
usually introspecting deep internals of subsystems so changes would
have to be carefully considered here as well) without a very strong
demand.

Sure it is a nice to have thing in some cases. I can imagine that some
of that information would have helped me when debugging some weird OOM
reports but I strongly suspect I would likely not have all necessary
pieces enabled because those were not reproducible. Having everything
on is just not usable due to amount of data. printk is not free and
we have seen cases where a lot of output just turned the machine into
unsuable state. If you have a reproducible OOMs then you can trigger
a panic and have the full state of the system to examine. So I am not
really convinced all this is going to be used to justify the maintenance
overhead.

All that being said, I do not think this is something we want to merge
without a really _strong_ usecase to back it.

Thanks!
--
Michal Hocko
SUSE Labs

2019-08-27 10:11:42

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH 00/10] OOM Debug print selection and additional information

On 2019/08/27 16:15, Michal Hocko wrote:
> All that being said, I do not think this is something we want to merge
> without a really _strong_ usecase to back it.

Like the sender's domain "arista.com" suggests, some of information is
geared towards networking devices, and ability to report OOM information
in a way suitable for automatic recording/analyzing (e.g. without using
shell prompt, let alone manually typing SysRq commands) would be convenient
for unattended devices. We have only one OOM killer implementation and
format/data are hard-coded. If we can make OOM killer modular, Edward would
be able to use it.

2019-08-27 10:40:09

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 00/10] OOM Debug print selection and additional information

On Tue 27-08-19 19:10:18, Tetsuo Handa wrote:
> On 2019/08/27 16:15, Michal Hocko wrote:
> > All that being said, I do not think this is something we want to merge
> > without a really _strong_ usecase to back it.
>
> Like the sender's domain "arista.com" suggests, some of information is
> geared towards networking devices, and ability to report OOM information
> in a way suitable for automatic recording/analyzing (e.g. without using
> shell prompt, let alone manually typing SysRq commands) would be convenient
> for unattended devices.

Why cannot the remote end of the logging identify the host. It has to
connect somewhere anyway, right? I also do assume that a log collector
already does store each log with host id of some form.

--
Michal Hocko
SUSE Labs

2019-08-27 12:43:03

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH 00/10] OOM Debug print selection and additional information

On Mon, 2019-08-26 at 12:36 -0700, Edward Chron wrote:
> This patch series provides code that works as a debug option through
> debugfs to provide additional controls to limit how much information
> gets printed when an OOM event occurs and or optionally print additional
> information about slab usage, vmalloc allocations, user process memory
> usage, the number of processes / tasks and some summary information
> about these tasks (number runable, i/o wait), system information
> (#CPUs, Kernel Version and other useful state of the system),
> ARP and ND Cache entry information.
>
> Linux OOM can optionally provide a lot of information, what's missing?
> ----------------------------------------------------------------------
> Linux provides a variety of detailed information when an OOM event occurs
> but has limited options to control how much output is produced. The
> system related information is produced unconditionally and limited per
> user process information is produced as a default enabled option. The
> per user process information may be disabled.
>
> Slab usage information was recently added and is output only if slab
> usage exceeds user memory usage.
>
> Many OOM events are due to user application memory usage sometimes in
> combination with the use of kernel resource usage that exceeds what is
> expected memory usage. Detailed information about how memory was being
> used when the event occurred may be required to identify the root cause
> of the OOM event.
>
> However, some environments are very large and printing all of the
> information about processes, slabs and or vmalloc allocations may
> not be feasible. For other environments printing as much information
> about these as possible may be needed to root cause OOM events.
>

For more in-depth analysis of OOM events, people could use kdump to save a
vmcore by setting "panic_on_oom", and then use the crash utility to analysis the
vmcore which contains pretty much all the information you need.

The downside of that approach is that this is probably only for enterprise use-
cases that kdump/crash may be tested properly on enterprise-level distros while
the combo is more broken for developers on consumer distros due to kdump/crash
could be affected by many kernel subsystems and have a tendency to be broken
fairly quickly where the community testing is pretty much light.

2019-08-28 00:51:07

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH 00/10] OOM Debug print selection and additional information



> On Aug 27, 2019, at 8:23 PM, Edward Chron <[email protected]> wrote:
>
>
>
> On Tue, Aug 27, 2019 at 5:40 AM Qian Cai <[email protected]> wrote:
> On Mon, 2019-08-26 at 12:36 -0700, Edward Chron wrote:
> > This patch series provides code that works as a debug option through
> > debugfs to provide additional controls to limit how much information
> > gets printed when an OOM event occurs and or optionally print additional
> > information about slab usage, vmalloc allocations, user process memory
> > usage, the number of processes / tasks and some summary information
> > about these tasks (number runable, i/o wait), system information
> > (#CPUs, Kernel Version and other useful state of the system),
> > ARP and ND Cache entry information.
> >
> > Linux OOM can optionally provide a lot of information, what's missing?
> > ----------------------------------------------------------------------
> > Linux provides a variety of detailed information when an OOM event occurs
> > but has limited options to control how much output is produced. The
> > system related information is produced unconditionally and limited per
> > user process information is produced as a default enabled option. The
> > per user process information may be disabled.
> >
> > Slab usage information was recently added and is output only if slab
> > usage exceeds user memory usage.
> >
> > Many OOM events are due to user application memory usage sometimes in
> > combination with the use of kernel resource usage that exceeds what is
> > expected memory usage. Detailed information about how memory was being
> > used when the event occurred may be required to identify the root cause
> > of the OOM event.
> >
> > However, some environments are very large and printing all of the
> > information about processes, slabs and or vmalloc allocations may
> > not be feasible. For other environments printing as much information
> > about these as possible may be needed to root cause OOM events.
> >
>
> For more in-depth analysis of OOM events, people could use kdump to save a
> vmcore by setting "panic_on_oom", and then use the crash utility to analysis the
> vmcore which contains pretty much all the information you need.
>
> Certainly, this is the ideal. A full system dump would give you the maximum amount of
> information.
>
> Unfortunately some environments may lack space to store the dump,

Kdump usually also support dumping to a remote target via NFS, SSH etc

> let alone the time to dump the storage contents and restart the system. Some

There is also “makedumpfile” that could compress and filter unwanted memory to reduce
the vmcore size and speed up the dumping process by utilizing multi-threads.

> systems can take many minutes to fully boot up, to reset and reinitialize all the
> devices. So unfortunately this is not always an option, and we need an OOM Report.

I am not sure how the system needs some minutes to reboot would be relevant for the
discussion here. The idea is to save a vmcore and it can be analyzed offline even on
another system as long as it having a matching “vmlinux.".


2019-08-28 01:11:38

by Edward Chron

[permalink] [raw]
Subject: Re: [PATCH 00/10] OOM Debug print selection and additional information

On Tue, Aug 27, 2019 at 12:15 AM Michal Hocko <[email protected]> wrote:
>
> On Mon 26-08-19 12:36:28, Edward Chron wrote:
> [...]
> > Extensibility using OOM debug options
> > -------------------------------------
> > What is needed is an extensible system to optionally configure
> > debug options as needed and to then dynamically enable and disable
> > them. Also for options that produce multiple lines of entry based
> > output, to configure which entries to print based on how much
> > memory they use (or optionally all the entries).
>
> With a patch this large and adding a lot of new stuff we need a more
> detailed usecases described I believe.

I guess it would make sense to explain motivation for each OOM Debug
option I've sent separately.
I see there comments on the patches I will try and add more information there.

An overview would be that we've been collecting information on OOM's
over the last 12 years or so.
These are from switches, other embedded devices, servers both large and small.
We ask for feedback on what information was helpful or could be helpful.
We try and add it to make root causing issues easier.

These OOM debug options are some of the options we've created.
I didn't port all of them to 5.3 but these are representative.
Our latest is kernel is a bit behind 5.3.

>
>
> [...]
>
> > Use of debugfs to allow dynamic controls
> > ----------------------------------------
> > By providing a debugfs interface that allows options to be configured,
> > enabled and where appropriate to set a minimum size for selecting
> > entries to print, the output produced when an OOM event occurs can be
> > dynamically adjusted to produce as little or as much detail as needed
> > for a given system.
>
> Who is going to consume this information and why would that consumer be
> unreasonable to demand further maintenance of that information in future
> releases? In other words debugfs is not considered a stableAPI which is
> OK here but the side effect of any change to these files results in user
> visible behavior and we consider that more or less a stable as long as
> there are consumers.
>
> > OOM debug options can be added to the base code as needed.
> >
> > Currently we have the following OOM debug options defined:
> >
> > * System State Summary
> > --------------------
> > One line of output that includes:
> > - Uptime (days, hour, minutes, seconds)
>
> We do have timestamps in the log so why is this needed?


Here is how an OOM report looks when we get it to look at:

Aug 26 09:06:34 coronado kernel: oomprocs invoked oom-killer:
gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0,
oom_score_adj=1000
Aug 26 09:06:34 coronado kernel: CPU: 1 PID: 2795 Comm: oomprocs Not
tainted 5.3.0-rc6+ #33
Aug 26 09:06:34 coronado kernel: Hardware name: Compulab Ltd.
IPC3/IPC3, BIOS 5.12_IPC3K.PRD.0.25.7 08/09/2018

This shows the date and time, not time of the last boot. The
/var/log/messages output is what we often have to look at not raw
dmesgs.

>
>
> > - Number CPUs
> > - Machine Type
> > - Node name
> > - Domain name
>
> why are these needed? That is a static information that doesn't really
> influence the OOM situation.


Sorry if a few of the items overlap what OOM prints.
We've been printing a lot of this information since 2.6.38 and OOM
reporting has been updated.

We're updating our 4.19 system to have the latest OOM Report format.
This was the 5.0 patch Reorg the OOM report in the dump header.
Also back porting Shakeel's 5.3 patch to refactor dump tasks for memcg OOMs.
We're testing those back ports right now in fact.

We can probably get rid of some of the information we have but I
haven't had a chance yet.
Hopefully can do it as part of sending some code upstream.

>
>
> > - Kernel Release
> > - Kernel Version
>
> part of the oom report
>
> >
> > Example output when configured and enabled:
> >
> > Jul 27 10:56:46 yoursystem kernel: System Uptime:0 days 00:17:27 CPUs:4 Machine:x86_64 Node:yoursystem Domain:localdomain Kernel Release:5.3.0-rc2+ Version: #49 SMP Mon Jul 27 10:35:32 PDT 2019
> >
> > * Tasks Summary
> > -------------
> > One line of output that includes:
> > - Number of Threads
> > - Number of processes
> > - Forks since boot
> > - Processes that are runnable
> > - Processes that are in iowait
>
> We do have sysrq+t for this kind of information. Why do we need to
> duplicate it?

Unfortunately, we can't login into every customer system or even
system of our own and do a sysrq+t after each OOM.
You could scan for OOMs and have a script do it, but doing a sysrq+t
after an OOM event, you'll get different results.
I'd rather have the runnable and iowait counts during the OOM event not after.
Computers are so darn fast, free up some memory and things can look a
lot different.

We've seen crond fork and hang and gradually create thousands of
processes and sorts of other unintended fork bombs.
On some systems we can't print all of the process information as we've
discussed.
So we print a summary of how many there are total and if you use the
select process print option you can print all the processes
that use more than 1% for example. That may be a dozen or two versus
hundreds or thousands. That may make printing some
user processes, the largest memory users feasible.

>
>
> > Example output when configured and enabled:
> >
> > Jul 22 15:20:57 yoursystem kernel: Threads:530 Processes:279 forks_since_boot:2786 procs_runable:2 procs_iowait:0
> >
> > * ARP Table and/or Neighbour Discovery Table Summary
> > --------------------------------------------------
> > One line of output each for ARP and ND that includes:
> > - Table name
> > - Table size (max # entries)
> > - Key Length
> > - Entry Size
> > - Number of Entries
> > - Last Flush (in seconds)
> > - hash grows
> > - entry allocations
> > - entry destroys
> > - Number lookups
> > - Number of lookup hits
> > - Resolution failures
> > - Garbage Collection Forced Runs
> > - Table Full
> > - Proxy Queue Length
> >
> > Example output when configured and enabled (for both):
> >
> > ... kernel: neighbour: Table: arp_tbl size: 256 keyLen: 4 entrySize: 360 entries: 9 lastFlush: 1721s hGrows: 1 allocs: 9 destroys: 0 lookups: 204 hits: 199 resFailed: 38 gcRuns/Forced: 111 / 0 tblFull: 0 proxyQlen: 0
> >
> > ... kernel: neighbour: Table: nd_tbl size: 128 keyLen: 16 entrySize: 368 entries: 6 lastFlush: 1720s hGrows: 0 allocs: 7 destroys: 1 lookups: 0 hits: 0 resFailed: 0 gcRuns/Forced: 110 / 0 tblFull: 0 proxyQlen: 0
>
> Again, why is this needed particularly for the OOM event? I do
> understand this might be useful system health diagnostic information but
> how does this contribute to the OOM?
>

It is example of some system table information we print.
Other adjustable table information may be useful as well.
These table sizes are often adjustable and collecting stats on usage
helps determine if settings are appropriate.
The value during OOM events is very useful as usage varies.
We also collect the same stats like this from user code periodically
and can compare these.

>
> > * Add Select Slabs Print
> > ----------------------
> > Allow select slab entries (based on a minimum size) to be printed.
> > Minimum size is specified as a percentage of the total RAM memory
> > in tenths of a percent, consistent with existing OOM process scoring.
> > Valid values are specified from 0 to 1000 where 0 prints all slab
> > entries (all slabs that have at least one slab object in use) up
> > to 1000 which would require a slab to use 100% of memory which can't
> > happen so in that case only summary information is printed.
> >
> > The first line of output is the standard Linux output header for
> > OOM printed Slab entries. This header looks like this:
> >
> > Aug 6 09:37:21 egc103 yourserver: Unreclaimable slab info:
> >
> > The output is existing slab entry memory usage limited such that only
> > entries equal to or larger than the minimum size are printed.
> > Empty slabs (no slab entries in slabs in use) are never printed.
> >
> > Additional output consists of summary information that is printed
> > at the end of the output. This summary information includes:
> > - # entries examined
> > - # entries selected and printed
> > - minimum entry size for selection
> > - Slabs total size (kB)
> > - Slabs reclaimable size (kB)
> > - Slabs unreclaimable size (kB)
> >
> > Example Summary output when configured and enabled:
> >
> > Jul 23 23:26:34 yoursystem kernel: Summary: Slab entries examined: 123 printed: 83 minsize: 0kB
> >
> > Jul 23 23:26:34 yoursystem kernel: Slabs Total: 151212kB Reclaim: 50632kB Unreclaim: 100580kB
>
> I am all for practical improvements for slab reporting. It is not really
> trivial to find a good balance though. Printing all the caches simply
> doesn't scale. So I would start by improving the current state rather
> than adding more configurability.


Yes, there is a challenge here and with the information you choose to
report when an OOM event occurs.
Paraphrasing, one size may not fit all.
To address this we tried to make it easy to add options and to allow
them to enabled / disabled.
We'd rather rate limit based on memory usage than have the kernel
print rate limit arbitrarily.
We had to make some choices on how to do this.

That said we view the OOM report as debugging information.
So if you change the format as long as we get the information we feel
is relevant, we're happy.
Since we print release and version information we can adjust our
scripts to handle format changes.
It's work but not really that big a deal.
If you remove information that was useful that is a bit more painful,
but not the end of the world.

>
>
> >
> > * Add Select Vmalloc allocations Print
> > ------------------------------------
> > Allow select vmalloc entries (based on a minimum size) to be printed.
> > Minimum size is specified as a percentage of the total RAM memory
> > in tenths of a percent, consistent with existing OOM process scoring.
> > Valid values are specified from 0 to 1000 where 0 prints all vmalloc
> > entries (all vmalloc allocations that have at least one page in use) up
> > to 1000 which would require a vmalloc to use 100% of memory which can't
> > happen so in that case only summary information is printed.
> >
> > The first line of output is a new Vmalloc output header for
> > OOM printed Vmalloc entries. This header looks like this:
> >
> > Aug 19 19:27:01 yourserver kernel: Vmalloc Info:
> >
> > The output is vmalloc entry information output limited such that only
> > entries equal to or larger than the minimum size are printed.
> > Unused vmallocs (no pages assigned to the vmalloc) are never printed.
> > The vmalloc entry information includes:
> > - Size (in bytes)
> > - pages (Number pages in use)
> > - Caller Information to identify the request
> >
> > A sample vmalloc entry output looks like this:
> >
> > Jul 22 20:16:09 yoursystem kernel: Vmalloc size=2625536 pages=640 caller=__do_sys_swapon+0x78e/0x113
> >
> > Additional output consists of summary information that is printed
> > at the end of the output. This summary information includes:
> > - Number of Vmalloc entries examined
> > - Number of Vmalloc entries printed
> > - minimum entry size for selection
> >
> > A sample Vmalloc Summary output looks like this:
> >
> > Aug 19 19:27:01 coronado kernel: Summary: Vmalloc entries examined: 1070 printed: 989 minsize: 0kB
>
> This is a lot of information. I wouldn't be surprised if this alone
> could easily overflow the ringbuffer. Besides that, it is rarely useful
> for the OOM situation debugging. The overall size of the vmalloc area
> is certainly interesting but I am not sure we have a handy counter to
> cope with constrained OOM contexts.
>

We've had cases where just displaying very large allocations explained
why an OOM event occurred.
We size this so we rarely get much output here, an entry or two at most.
Again it is optional so if you don't care don't enable it.

>
> > * Add Select Process Entries Print
> > --------------------------------
> > Allow select process entries (based on a minimum size) to be printed.
> > Minimum size is specified as a percentage totalpages (RAM + swap)
> > in tenths of a percent, consistent with existing OOM process scoring.
> > Note: user process memory can be swapped out when swap space present
> > so that is why swap space and ram memory comprise the totalpages
> > used to calculate the percentage of memory a process is using.
> > Valid values are specified from 0 to 1000 where 0 prints all user
> > processes (that have valid mm sections and aren't exiting) up to
> > 1000 which would require a user process to use 100% of memory which
> > can't happen so in that case only summary information is printed.
> >
> > The first line of output is the standard Linux output headers for
> > OOM printed User Processes. This header looks like this:
> >
> > Aug 19 19:27:01 yourserver kernel: Tasks state (memory values in pages):
> > Aug 19 19:27:01 yourserver kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
> >
> > The output is existing per user process data limited such that only
> > entries equal to or larger than the minimum size are printed.
> >
> > Jul 21 20:07:48 yourserver kernel: [ 579] 0 579 7942 1010 90112 0 -1000 systemd-udevd
> >
> > Additional output consists of summary information that is printed
> > at the end of the output. This summary information includes:
> >
> > Aug 19 19:27:01 yourserver kernel: Summary: OOM Tasks considered:277 printed:143 minimum size:0kB totalpages:32791608kB
>
> This sounds like a good idea to limit the eligible process list size but
> I am concerned that it might get misleading easily when there are many
> small processes contributing to the OOM in the end.
>
> > * Add Enhanced Process Print Information
> > --------------------------------------
> > Add OOM Debug code that prints additional detailed information about
> > users processes that were considered for OOM killing for any print
> > selected processes. The information is displayed for each user process
> > that OOM prints in the output.
> >
> > This supplemental per user process information is very helpful for
> > determing how process memory is used to allow OOM event root cause
> > identifcation that might not otherwise be possible.
> >
> > Output information for enhanced user process entrys printed includes:
> > - pid
> > - parent pid
> > - ruid
> > - euid
> > - tgid
> > - Process State (S)
> > - utime in seconds
> > - stime in seconds
> > - oom_score_adjust
> > - task comm value (name of process)
> > - Vmem KiB
> > - MaxRss KiB
> > - CurRss KiB
> > - Pte KiB
> > - Swap KiB
> > - Sock KiB
> > - Lib KiB
> > - Text KiB
> > - Heap KiB
> > - Stack KiB
> > - File KiB
> > - Shmem KiB
> > - Read Pages
> > - Fault Pages
> > - Lock KiB
> > - Pinned KiB
>
> I can see some of these being interesting but I would rather pick up
> those and add to the regular oom output rather than go over configuring
> them.
>

Would be glad to add these to standard OOM output.
One issue is there are extra bytes of output with more detail.
So when constrained, to justify this we said we'd rather have lots of
detail on the top 50 or so many consuming processes versus course
information on all user processes.
The task information will provide us counts of processes and measures
of process creation that are very useful.

>
> > Configuring Patches:
> > -------------------
> > OOM Debug and any options you want to use must first be configured so
> > the code is included in your kernel. This requires selecting kernel
> > config file options. You will find config options to select under:
> >
> > Kernel hacking ---> Memory Debugging --->
> >
> > [*] Debug OOM
> > [*] Debug OOM System State
> > [*] Debug OOM System Tasks Summary
> > [*] Debug OOM ARP Table
> > [*] Debug OOM ND Table
> > [*] Debug OOM Select Slabs Print
> > [*] Debug OOM Slabs Select Always Print Enable
> > [*] Debug OOM Enhanced Slab Print
> > [*] Debug OOM Select Vmallocs Print
> > [*] Debug OOM Select Process Print
> > [*] Debug OOM Enhanced Process Print
>
> I really dislike these though. We already have zillions of debugging
> options and the config space is enormous. Different combinations of them
> make any compile testing a challenge and a lot of cpu cycles eaten.
> Besides that, who is going to configure those in without using them
> directly? Distributions are not going to enable without having all
> options being disabled by default for example.
>

Oh I agree, I dislike configuration options, there are so many and
when you upgrade you're like what now.
That said, I understand their value when range from small embedded
devices up to super computers and
zillions of devices.

I would be pleased to just have one configuration option or better yet
just have the code be part of
the standard system. So getting rid of any or all of that would be a pleasure.
Quite honestly, we may argue for certain items but in general we're
quite flexible.

> > 12 files changed, 1339 insertions(+), 11 deletions(-)
>
> This must have a been a lot of work and I really appreciate that.
>
> On the other hand it is a lot of code to maintain (note that you are
> usually introspecting deep internals of subsystems so changes would
> have to be carefully considered here as well) without a very strong
> demand.
>
> Sure it is a nice to have thing in some cases. I can imagine that some
> of that information would have helped me when debugging some weird OOM
> reports but I strongly suspect I would likely not have all necessary
> pieces enabled because those were not reproducible. Having everything
> on is just not usable due to amount of data. printk is not free and
> we have seen cases where a lot of output just turned the machine into
> unsuable state. If you have a reproducible OOMs then you can trigger
> a panic and have the full state of the system to examine. So I am not
> really convinced all this is going to be used to justify the maintenance
> overhead.


I can speak to many OOM events we have had to triage and root cause
over the past
7+ years that I've been involved with. It is quite true that there is
no single OOM report
format that will allow every problem to be completely root caused. The
OOM report
cannot provide all the information a full dump provides. That said,
the OOM report can give
you an excellent start on where to look when you otherwise aren't sure
where to look.
With luck everything you need is in the OOM report and you root cause
right there.

I can give you all sort of examples of this.
They're all anecdotal but I would expect that admin and support people
in data centers
see much of the same sorts of issues. Would welcome input from others too.
Different environments certainly can vary.

On the issue of reproducible OOMs verus non-reproducible, that is
important to consider:

First many OOMs we look at happen in the data center and they are not easily
reproducible. The analogy I use we spend a lot of time having to drive
with our tail lights.
That is we do a postmortem with limited information after the fact. Why?
We don't have the time or luxury to turn on panic on OOM and let the
system reboot.
In fact we very often have neither the time it takes to dump the
system or the storage space
to hold a full system dump, a shame as it is the best scenario for sure.
If a switch locks up for a few seconds the routing protocols can time
out and that can
start a reconfiguration chain reaction in your data center that will
not be well received.

If we could take a full system dump every time we need to capture the
state of the system
you wouldn't need an OOM report. In fact where else in the Kernel does
the Kernel produce
a report? OOM events are an odd beast that for some systems are just
an annoyance and
on other systems can be quite painful.

If you're lucky you can ignore the fact that OOM killed one of your
tabs in chrome browser.
Your not so lucky if a key process gets OOM killed causing a cascade
of issues. The more
pain you feel the more motivated you become to try and avoid future events.

We're not touching situations where OOM events occur in clusters or
periodically due to
a persisting issue or lots of other OOM dramas that occur from time to
time. For people who
are unlucky and have to care about OOM events, you often can't
reproduce these and you
want to capture as much information as is reasonable so you can work
what the cause was
with the hope that you can prevent future events.

How much information is reasonable and what information you want to
record may vary.

> All that being said, I do not think this is something we want to merge
> without a really _strong_ usecase to back it.
>

I will supply any information that I can. Let me know specifics on
what you need.
I guess I can try an explain a justification for each option I sent
and we can have a dialog as needed.
That is at least a starting point.

I was hoping that posting this code and starting a discussion might
draw in both experts and
others with an interest in the information that is produced for an OOM event.

Our experience is that some additional information and the ability to
adjust what is produced is valuable.
We don't add new options all the time but making it easy to do so is helpful.

It would be nice if everything was standard output but even optional
configurable information is better than none.
We can continue to mod our kernel but if others would benefit, we're
happy to contribute to the best
of our abilities. We're flexible enough to make any recommended
improvements as well.

Also, our implementation though we've been using it for some years,
and it continues to evolve, is a
reference implementation. Since the output is debugging information
and we identify what system
release and version produces the output with each event, we can adjust
our scripts to deal without
output changes as the system evolves. This is expected as systems and
Linux continue to evolve
and improve.

We'd be happy to work with you and your colleagues to contribute any
improvements that you can accept
to help to improve the OOM Report output.

Thank-you again for your time and consideration!

Edward Chron
Arista Networks

>
> Thanks!
> --
> Michal Hocko
> SUSE Labs

2019-08-28 01:14:41

by Edward Chron

[permalink] [raw]
Subject: Re: [PATCH 00/10] OOM Debug print selection and additional information

On Tue, Aug 27, 2019 at 5:50 PM Qian Cai <[email protected]> wrote:
>
>
>
> > On Aug 27, 2019, at 8:23 PM, Edward Chron <[email protected]> wrote:
> >
> >
> >
> > On Tue, Aug 27, 2019 at 5:40 AM Qian Cai <[email protected]> wrote:
> > On Mon, 2019-08-26 at 12:36 -0700, Edward Chron wrote:
> > > This patch series provides code that works as a debug option through
> > > debugfs to provide additional controls to limit how much information
> > > gets printed when an OOM event occurs and or optionally print additional
> > > information about slab usage, vmalloc allocations, user process memory
> > > usage, the number of processes / tasks and some summary information
> > > about these tasks (number runable, i/o wait), system information
> > > (#CPUs, Kernel Version and other useful state of the system),
> > > ARP and ND Cache entry information.
> > >
> > > Linux OOM can optionally provide a lot of information, what's missing?
> > > ----------------------------------------------------------------------
> > > Linux provides a variety of detailed information when an OOM event occurs
> > > but has limited options to control how much output is produced. The
> > > system related information is produced unconditionally and limited per
> > > user process information is produced as a default enabled option. The
> > > per user process information may be disabled.
> > >
> > > Slab usage information was recently added and is output only if slab
> > > usage exceeds user memory usage.
> > >
> > > Many OOM events are due to user application memory usage sometimes in
> > > combination with the use of kernel resource usage that exceeds what is
> > > expected memory usage. Detailed information about how memory was being
> > > used when the event occurred may be required to identify the root cause
> > > of the OOM event.
> > >
> > > However, some environments are very large and printing all of the
> > > information about processes, slabs and or vmalloc allocations may
> > > not be feasible. For other environments printing as much information
> > > about these as possible may be needed to root cause OOM events.
> > >
> >
> > For more in-depth analysis of OOM events, people could use kdump to save a
> > vmcore by setting "panic_on_oom", and then use the crash utility to analysis the
> > vmcore which contains pretty much all the information you need.
> >
> > Certainly, this is the ideal. A full system dump would give you the maximum amount of
> > information.
> >
> > Unfortunately some environments may lack space to store the dump,
>
> Kdump usually also support dumping to a remote target via NFS, SSH etc
>
> > let alone the time to dump the storage contents and restart the system. Some
>
> There is also “makedumpfile” that could compress and filter unwanted memory to reduce
> the vmcore size and speed up the dumping process by utilizing multi-threads.
>
> > systems can take many minutes to fully boot up, to reset and reinitialize all the
> > devices. So unfortunately this is not always an option, and we need an OOM Report.
>
> I am not sure how the system needs some minutes to reboot would be relevant for the
> discussion here. The idea is to save a vmcore and it can be analyzed offline even on
> another system as long as it having a matching “vmlinux.".
>
>

If selecting a dump on an OOM event doesn't reboot the system and if
it runs fast enough such
that it doesn't slow processing enough to appreciably effect the
system's responsiveness then
then it would be ideal solution. For some it would be over kill but
since it is an option it is a
choice to consider or not.

2019-08-28 01:33:49

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH 00/10] OOM Debug print selection and additional information



> On Aug 27, 2019, at 9:13 PM, Edward Chron <[email protected]> wrote:
>
> On Tue, Aug 27, 2019 at 5:50 PM Qian Cai <[email protected]> wrote:
>>
>>
>>
>>> On Aug 27, 2019, at 8:23 PM, Edward Chron <[email protected]> wrote:
>>>
>>>
>>>
>>> On Tue, Aug 27, 2019 at 5:40 AM Qian Cai <[email protected]> wrote:
>>> On Mon, 2019-08-26 at 12:36 -0700, Edward Chron wrote:
>>>> This patch series provides code that works as a debug option through
>>>> debugfs to provide additional controls to limit how much information
>>>> gets printed when an OOM event occurs and or optionally print additional
>>>> information about slab usage, vmalloc allocations, user process memory
>>>> usage, the number of processes / tasks and some summary information
>>>> about these tasks (number runable, i/o wait), system information
>>>> (#CPUs, Kernel Version and other useful state of the system),
>>>> ARP and ND Cache entry information.
>>>>
>>>> Linux OOM can optionally provide a lot of information, what's missing?
>>>> ----------------------------------------------------------------------
>>>> Linux provides a variety of detailed information when an OOM event occurs
>>>> but has limited options to control how much output is produced. The
>>>> system related information is produced unconditionally and limited per
>>>> user process information is produced as a default enabled option. The
>>>> per user process information may be disabled.
>>>>
>>>> Slab usage information was recently added and is output only if slab
>>>> usage exceeds user memory usage.
>>>>
>>>> Many OOM events are due to user application memory usage sometimes in
>>>> combination with the use of kernel resource usage that exceeds what is
>>>> expected memory usage. Detailed information about how memory was being
>>>> used when the event occurred may be required to identify the root cause
>>>> of the OOM event.
>>>>
>>>> However, some environments are very large and printing all of the
>>>> information about processes, slabs and or vmalloc allocations may
>>>> not be feasible. For other environments printing as much information
>>>> about these as possible may be needed to root cause OOM events.
>>>>
>>>
>>> For more in-depth analysis of OOM events, people could use kdump to save a
>>> vmcore by setting "panic_on_oom", and then use the crash utility to analysis the
>>> vmcore which contains pretty much all the information you need.
>>>
>>> Certainly, this is the ideal. A full system dump would give you the maximum amount of
>>> information.
>>>
>>> Unfortunately some environments may lack space to store the dump,
>>
>> Kdump usually also support dumping to a remote target via NFS, SSH etc
>>
>>> let alone the time to dump the storage contents and restart the system. Some
>>
>> There is also “makedumpfile” that could compress and filter unwanted memory to reduce
>> the vmcore size and speed up the dumping process by utilizing multi-threads.
>>
>>> systems can take many minutes to fully boot up, to reset and reinitialize all the
>>> devices. So unfortunately this is not always an option, and we need an OOM Report.
>>
>> I am not sure how the system needs some minutes to reboot would be relevant for the
>> discussion here. The idea is to save a vmcore and it can be analyzed offline even on
>> another system as long as it having a matching “vmlinux.".
>>
>>
>
> If selecting a dump on an OOM event doesn't reboot the system and if
> it runs fast enough such
> that it doesn't slow processing enough to appreciably effect the
> system's responsiveness then
> then it would be ideal solution. For some it would be over kill but
> since it is an option it is a
> choice to consider or not.

It sounds like you are looking for more of this,

https://github.com/iovisor/bcc/blob/master/tools/oomkill.py

2019-08-28 02:48:40

by Edward Chron

[permalink] [raw]
Subject: Re: [PATCH 00/10] OOM Debug print selection and additional information

On Tue, Aug 27, 2019 at 6:32 PM Qian Cai <[email protected]> wrote:
>
>
>
> > On Aug 27, 2019, at 9:13 PM, Edward Chron <[email protected]> wrote:
> >
> > On Tue, Aug 27, 2019 at 5:50 PM Qian Cai <[email protected]> wrote:
> >>
> >>
> >>
> >>> On Aug 27, 2019, at 8:23 PM, Edward Chron <[email protected]> wrote:
> >>>
> >>>
> >>>
> >>> On Tue, Aug 27, 2019 at 5:40 AM Qian Cai <[email protected]> wrote:
> >>> On Mon, 2019-08-26 at 12:36 -0700, Edward Chron wrote:
> >>>> This patch series provides code that works as a debug option through
> >>>> debugfs to provide additional controls to limit how much information
> >>>> gets printed when an OOM event occurs and or optionally print additional
> >>>> information about slab usage, vmalloc allocations, user process memory
> >>>> usage, the number of processes / tasks and some summary information
> >>>> about these tasks (number runable, i/o wait), system information
> >>>> (#CPUs, Kernel Version and other useful state of the system),
> >>>> ARP and ND Cache entry information.
> >>>>
> >>>> Linux OOM can optionally provide a lot of information, what's missing?
> >>>> ----------------------------------------------------------------------
> >>>> Linux provides a variety of detailed information when an OOM event occurs
> >>>> but has limited options to control how much output is produced. The
> >>>> system related information is produced unconditionally and limited per
> >>>> user process information is produced as a default enabled option. The
> >>>> per user process information may be disabled.
> >>>>
> >>>> Slab usage information was recently added and is output only if slab
> >>>> usage exceeds user memory usage.
> >>>>
> >>>> Many OOM events are due to user application memory usage sometimes in
> >>>> combination with the use of kernel resource usage that exceeds what is
> >>>> expected memory usage. Detailed information about how memory was being
> >>>> used when the event occurred may be required to identify the root cause
> >>>> of the OOM event.
> >>>>
> >>>> However, some environments are very large and printing all of the
> >>>> information about processes, slabs and or vmalloc allocations may
> >>>> not be feasible. For other environments printing as much information
> >>>> about these as possible may be needed to root cause OOM events.
> >>>>
> >>>
> >>> For more in-depth analysis of OOM events, people could use kdump to save a
> >>> vmcore by setting "panic_on_oom", and then use the crash utility to analysis the
> >>> vmcore which contains pretty much all the information you need.
> >>>
> >>> Certainly, this is the ideal. A full system dump would give you the maximum amount of
> >>> information.
> >>>
> >>> Unfortunately some environments may lack space to store the dump,
> >>
> >> Kdump usually also support dumping to a remote target via NFS, SSH etc
> >>
> >>> let alone the time to dump the storage contents and restart the system. Some
> >>
> >> There is also “makedumpfile” that could compress and filter unwanted memory to reduce
> >> the vmcore size and speed up the dumping process by utilizing multi-threads.
> >>
> >>> systems can take many minutes to fully boot up, to reset and reinitialize all the
> >>> devices. So unfortunately this is not always an option, and we need an OOM Report.
> >>
> >> I am not sure how the system needs some minutes to reboot would be relevant for the
> >> discussion here. The idea is to save a vmcore and it can be analyzed offline even on
> >> another system as long as it having a matching “vmlinux.".
> >>
> >>
> >
> > If selecting a dump on an OOM event doesn't reboot the system and if
> > it runs fast enough such
> > that it doesn't slow processing enough to appreciably effect the
> > system's responsiveness then
> > then it would be ideal solution. For some it would be over kill but
> > since it is an option it is a
> > choice to consider or not.
>
> It sounds like you are looking for more of this,

If you want to supplement the OOM Report and keep the information together than
you could use EBPF to do that. If that really is the preference it
might make sense
to put the entire report as an EBPF script than you can modify the
script however
you choose. That would be very flexible. You can change your
configuration on the
fly. As long as it has access to everything you need it should work.

Michal would know what direction OOM is headed and if he thinks that fits with
where things are headed.

I'm flexible in he sense that I could change our submission to make
specific updates
to the existing OOM code. We kept it as separate as possible as for
ease of porting.
But if we can build an acceptable case for making updates to the existing OOM
Report code that works.

Our current implementation has some knobs to allow some limited scaling that
has advantages over print rate limiting and it may allow environments
that didn't
want to allow process printing or slab or vmalloc entry allocations
printing to do
so without generating a lot of output.

But the existing code could be modified to do the same thing. Possibly without
having a configuration interface if that is not desirable. It could look at
the number entries to potentially print for example and if the number
is small it
could print them all or scale selection based on a default memory usage. Do you
really care about slab or vmalloc entries using 1 MB or less of memory on a
256 GB system for example? Probably not. Our approach let's you size this
and has a default that may be reasonable for many environments. But it allows
you to configure things which adds some complexity.

Now you could in theory produce the entire OOM Report plus anything we've
purposed with an EBPF script. Haven't done it but assume it works with 5.3.
Problem with any type of plugin and or configurable option is testing as
Michal mentions and the fact it may or not be present.

For production systems installing and updating EBPF scripts may someday
be very common, but I wonder how data center managers feel about it now?
Developers are very excited about it and it is a very powerful tool but can I
get permission to add or replace an existing EBPF on production systems?
If there is reluctance for security or reliability or any issue than I
would rather
have the code in the kernel so I know it is there and is tested. Just as I would
prefer not to have the config options for reasons Michal cites, but
I'll take that
if that is the best I can get.

Will be interested to hear what Michal advises.

>
> https://github.com/iovisor/bcc/blob/master/tools/oomkill.py
>

2019-08-28 07:02:41

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 00/10] OOM Debug print selection and additional information

On Tue 27-08-19 18:07:54, Edward Chron wrote:
> On Tue, Aug 27, 2019 at 12:15 AM Michal Hocko <[email protected]> wrote:
> >
> > On Mon 26-08-19 12:36:28, Edward Chron wrote:
> > [...]
> > > Extensibility using OOM debug options
> > > -------------------------------------
> > > What is needed is an extensible system to optionally configure
> > > debug options as needed and to then dynamically enable and disable
> > > them. Also for options that produce multiple lines of entry based
> > > output, to configure which entries to print based on how much
> > > memory they use (or optionally all the entries).
> >
> > With a patch this large and adding a lot of new stuff we need a more
> > detailed usecases described I believe.
>
> I guess it would make sense to explain motivation for each OOM Debug
> option I've sent separately.
> I see there comments on the patches I will try and add more information there.
>
> An overview would be that we've been collecting information on OOM's
> over the last 12 years or so.
> These are from switches, other embedded devices, servers both large and small.
> We ask for feedback on what information was helpful or could be helpful.
> We try and add it to make root causing issues easier.
>
> These OOM debug options are some of the options we've created.
> I didn't port all of them to 5.3 but these are representative.
> Our latest is kernel is a bit behind 5.3.
>
> >
> >
> > [...]
> >
> > > Use of debugfs to allow dynamic controls
> > > ----------------------------------------
> > > By providing a debugfs interface that allows options to be configured,
> > > enabled and where appropriate to set a minimum size for selecting
> > > entries to print, the output produced when an OOM event occurs can be
> > > dynamically adjusted to produce as little or as much detail as needed
> > > for a given system.
> >
> > Who is going to consume this information and why would that consumer be
> > unreasonable to demand further maintenance of that information in future
> > releases? In other words debugfs is not considered a stableAPI which is
> > OK here but the side effect of any change to these files results in user
> > visible behavior and we consider that more or less a stable as long as
> > there are consumers.
> >
> > > OOM debug options can be added to the base code as needed.
> > >
> > > Currently we have the following OOM debug options defined:
> > >
> > > * System State Summary
> > > --------------------
> > > One line of output that includes:
> > > - Uptime (days, hour, minutes, seconds)
> >
> > We do have timestamps in the log so why is this needed?
>
>
> Here is how an OOM report looks when we get it to look at:
>
> Aug 26 09:06:34 coronado kernel: oomprocs invoked oom-killer:
> gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0,
> oom_score_adj=1000
> Aug 26 09:06:34 coronado kernel: CPU: 1 PID: 2795 Comm: oomprocs Not
> tainted 5.3.0-rc6+ #33
> Aug 26 09:06:34 coronado kernel: Hardware name: Compulab Ltd.
> IPC3/IPC3, BIOS 5.12_IPC3K.PRD.0.25.7 08/09/2018
>
> This shows the date and time, not time of the last boot. The
> /var/log/messages output is what we often have to look at not raw
> dmesgs.

This looks more like a configuration of the logging than a kernel
problem. Kernel does provide timestamps for logs. E.g.
$ tail -n1 /var/log/kern.log
Aug 28 08:27:46 tiehlicka kernel: <1054>[336340.954345] systemd-udevd[7971]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.

[...]
> > > Example output when configured and enabled:
> > >
> > > Jul 22 15:20:57 yoursystem kernel: Threads:530 Processes:279 forks_since_boot:2786 procs_runable:2 procs_iowait:0
> > >
> > > * ARP Table and/or Neighbour Discovery Table Summary
> > > --------------------------------------------------
> > > One line of output each for ARP and ND that includes:
> > > - Table name
> > > - Table size (max # entries)
> > > - Key Length
> > > - Entry Size
> > > - Number of Entries
> > > - Last Flush (in seconds)
> > > - hash grows
> > > - entry allocations
> > > - entry destroys
> > > - Number lookups
> > > - Number of lookup hits
> > > - Resolution failures
> > > - Garbage Collection Forced Runs
> > > - Table Full
> > > - Proxy Queue Length
> > >
> > > Example output when configured and enabled (for both):
> > >
> > > ... kernel: neighbour: Table: arp_tbl size: 256 keyLen: 4 entrySize: 360 entries: 9 lastFlush: 1721s hGrows: 1 allocs: 9 destroys: 0 lookups: 204 hits: 199 resFailed: 38 gcRuns/Forced: 111 / 0 tblFull: 0 proxyQlen: 0
> > >
> > > ... kernel: neighbour: Table: nd_tbl size: 128 keyLen: 16 entrySize: 368 entries: 6 lastFlush: 1720s hGrows: 0 allocs: 7 destroys: 1 lookups: 0 hits: 0 resFailed: 0 gcRuns/Forced: 110 / 0 tblFull: 0 proxyQlen: 0
> >
> > Again, why is this needed particularly for the OOM event? I do
> > understand this might be useful system health diagnostic information but
> > how does this contribute to the OOM?
> >
>
> It is example of some system table information we print.
> Other adjustable table information may be useful as well.
> These table sizes are often adjustable and collecting stats on usage
> helps determine if settings are appropriate.
> The value during OOM events is very useful as usage varies.
> We also collect the same stats like this from user code periodically
> and can compare these.

I suspect that this is a very narrow usecase and there are more like
that and I can imagine somebody with a different workload could come up
with yet another set of useful information to print. The more I think of these
additional modules the more I am convinced that this "plugin" architecture
is a wrong approach. Why? Mostly because all the code maintenance burden
is likely to be not worth all the niche usecase. This all has to be more
dynamic and ideally scriptable so that the code in the kernel just
provides the basic information and everybody can just hook in there and
dump whatever additional information is needed. Sounds like something
that eBPF could fit in, no? Have you considered that?

[...]

Skipping over many useful stuff. I can reassure you that my experience
with OOM debugging has been a real pain at times (e.g. when there is
simply no way to find out who has eaten all the memory because it is not
accounted anywhere) as well and I completely understand where you are
coming from. There is definitely a room for improvements we just have to
find a way how to get there.

Thanks!
--
Michal Hocko
SUSE Labs

2019-08-28 07:09:59

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 00/10] OOM Debug print selection and additional information

On Tue 27-08-19 19:47:22, Edward Chron wrote:
> On Tue, Aug 27, 2019 at 6:32 PM Qian Cai <[email protected]> wrote:
> >
> >
> >
> > > On Aug 27, 2019, at 9:13 PM, Edward Chron <[email protected]> wrote:
> > >
> > > On Tue, Aug 27, 2019 at 5:50 PM Qian Cai <[email protected]> wrote:
> > >>
> > >>
> > >>
> > >>> On Aug 27, 2019, at 8:23 PM, Edward Chron <[email protected]> wrote:
> > >>>
> > >>>
> > >>>
> > >>> On Tue, Aug 27, 2019 at 5:40 AM Qian Cai <[email protected]> wrote:
> > >>> On Mon, 2019-08-26 at 12:36 -0700, Edward Chron wrote:
> > >>>> This patch series provides code that works as a debug option through
> > >>>> debugfs to provide additional controls to limit how much information
> > >>>> gets printed when an OOM event occurs and or optionally print additional
> > >>>> information about slab usage, vmalloc allocations, user process memory
> > >>>> usage, the number of processes / tasks and some summary information
> > >>>> about these tasks (number runable, i/o wait), system information
> > >>>> (#CPUs, Kernel Version and other useful state of the system),
> > >>>> ARP and ND Cache entry information.
> > >>>>
> > >>>> Linux OOM can optionally provide a lot of information, what's missing?
> > >>>> ----------------------------------------------------------------------
> > >>>> Linux provides a variety of detailed information when an OOM event occurs
> > >>>> but has limited options to control how much output is produced. The
> > >>>> system related information is produced unconditionally and limited per
> > >>>> user process information is produced as a default enabled option. The
> > >>>> per user process information may be disabled.
> > >>>>
> > >>>> Slab usage information was recently added and is output only if slab
> > >>>> usage exceeds user memory usage.
> > >>>>
> > >>>> Many OOM events are due to user application memory usage sometimes in
> > >>>> combination with the use of kernel resource usage that exceeds what is
> > >>>> expected memory usage. Detailed information about how memory was being
> > >>>> used when the event occurred may be required to identify the root cause
> > >>>> of the OOM event.
> > >>>>
> > >>>> However, some environments are very large and printing all of the
> > >>>> information about processes, slabs and or vmalloc allocations may
> > >>>> not be feasible. For other environments printing as much information
> > >>>> about these as possible may be needed to root cause OOM events.
> > >>>>
> > >>>
> > >>> For more in-depth analysis of OOM events, people could use kdump to save a
> > >>> vmcore by setting "panic_on_oom", and then use the crash utility to analysis the
> > >>> vmcore which contains pretty much all the information you need.
> > >>>
> > >>> Certainly, this is the ideal. A full system dump would give you the maximum amount of
> > >>> information.
> > >>>
> > >>> Unfortunately some environments may lack space to store the dump,
> > >>
> > >> Kdump usually also support dumping to a remote target via NFS, SSH etc
> > >>
> > >>> let alone the time to dump the storage contents and restart the system. Some
> > >>
> > >> There is also “makedumpfile” that could compress and filter unwanted memory to reduce
> > >> the vmcore size and speed up the dumping process by utilizing multi-threads.
> > >>
> > >>> systems can take many minutes to fully boot up, to reset and reinitialize all the
> > >>> devices. So unfortunately this is not always an option, and we need an OOM Report.
> > >>
> > >> I am not sure how the system needs some minutes to reboot would be relevant for the
> > >> discussion here. The idea is to save a vmcore and it can be analyzed offline even on
> > >> another system as long as it having a matching “vmlinux.".
> > >>
> > >>
> > >
> > > If selecting a dump on an OOM event doesn't reboot the system and if
> > > it runs fast enough such
> > > that it doesn't slow processing enough to appreciably effect the
> > > system's responsiveness then
> > > then it would be ideal solution. For some it would be over kill but
> > > since it is an option it is a
> > > choice to consider or not.
> >
> > It sounds like you are looking for more of this,
>
> If you want to supplement the OOM Report and keep the information
> together than you could use EBPF to do that. If that really is the
> preference it might make sense to put the entire report as an EBPF
> script than you can modify the script however you choose. That would
> be very flexible. You can change your configuration on the fly. As
> long as it has access to everything you need it should work.
>
> Michal would know what direction OOM is headed and if he thinks that fits with
> where things are headed.

It seems we have landed in the similar thinking here. As mentioned in my
earlier email in this thread I can see the extensibility to be achieved
by eBPF. Essentially we would have a base form of the oom report like
now and scripts would then hook in there to provide whatever a specific
usecase needs. My practical experience with eBPF is close to zero so I
have no idea how that would actually work out though.

[...]
> For production systems installing and updating EBPF scripts may someday
> be very common, but I wonder how data center managers feel about it now?
> Developers are very excited about it and it is a very powerful tool but can I
> get permission to add or replace an existing EBPF on production systems?

I am not sure I understand. There must be somebody trusted to take care
of systems, right?
--
Michal Hocko
SUSE Labs

2019-08-28 10:15:28

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH 00/10] OOM Debug print selection and additional information

On 2019/08/28 16:08, Michal Hocko wrote:
> On Tue 27-08-19 19:47:22, Edward Chron wrote:
>> For production systems installing and updating EBPF scripts may someday
>> be very common, but I wonder how data center managers feel about it now?
>> Developers are very excited about it and it is a very powerful tool but can I
>> get permission to add or replace an existing EBPF on production systems?
>
> I am not sure I understand. There must be somebody trusted to take care
> of systems, right?
>

Speak of my cases, those who take care of their systems are not developers.
And they afraid changing code that runs in kernel mode. They unlikely give
permission to install SystemTap/eBPF scripts. As a result, in many cases,
the root cause cannot be identified.

Moreover, we are talking about OOM situations, where we can't expect userspace
processes to work properly. We need to dump information we want, without
counting on userspace processes, before sending SIGKILL.

2019-08-28 10:34:33

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 00/10] OOM Debug print selection and additional information

On Wed 28-08-19 19:12:41, Tetsuo Handa wrote:
> On 2019/08/28 16:08, Michal Hocko wrote:
> > On Tue 27-08-19 19:47:22, Edward Chron wrote:
> >> For production systems installing and updating EBPF scripts may someday
> >> be very common, but I wonder how data center managers feel about it now?
> >> Developers are very excited about it and it is a very powerful tool but can I
> >> get permission to add or replace an existing EBPF on production systems?
> >
> > I am not sure I understand. There must be somebody trusted to take care
> > of systems, right?
> >
>
> Speak of my cases, those who take care of their systems are not developers.
> And they afraid changing code that runs in kernel mode. They unlikely give
> permission to install SystemTap/eBPF scripts. As a result, in many cases,
> the root cause cannot be identified.

Which is something I would call a process problem more than a kernel
one. Really if you need to debug a problem you really have to trust
those who can debug that for you. We are not going to take tons of code
to the kernel just because somebody is afraid to run a diagnostic.

> Moreover, we are talking about OOM situations, where we can't expect userspace
> processes to work properly. We need to dump information we want, without
> counting on userspace processes, before sending SIGKILL.

Yes, this is an inherent assumption I was making and that means that
whatever dynamic hooks would have to be registered in advance.

--
Michal Hocko
SUSE Labs

2019-08-28 10:58:37

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH 00/10] OOM Debug print selection and additional information

On 2019/08/28 19:32, Michal Hocko wrote:
>> Speak of my cases, those who take care of their systems are not developers.
>> And they afraid changing code that runs in kernel mode. They unlikely give
>> permission to install SystemTap/eBPF scripts. As a result, in many cases,
>> the root cause cannot be identified.
>
> Which is something I would call a process problem more than a kernel
> one. Really if you need to debug a problem you really have to trust
> those who can debug that for you. We are not going to take tons of code
> to the kernel just because somebody is afraid to run a diagnostic.
>

This is a problem of kernel development process.

>> Moreover, we are talking about OOM situations, where we can't expect userspace
>> processes to work properly. We need to dump information we want, without
>> counting on userspace processes, before sending SIGKILL.
>
> Yes, this is an inherent assumption I was making and that means that
> whatever dynamic hooks would have to be registered in advance.
>

No. I'm saying that neither static hooks nor dynamic hooks can work as
expected if they count on userspace processes. Registering in advance is
irrelevant. Whether it can work without userspace processes is relevant.

Also, out-of-tree codes tend to become defunctional. We are trying to debug
problems caused by in-tree code. Breaking out-of-tree debugging code just
because in-tree code developers don't want to pay the burden of maintaining
code for debugging problems caused by in-tree code is a very bad idea.

2019-08-28 11:14:10

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 00/10] OOM Debug print selection and additional information

On Wed 28-08-19 19:56:58, Tetsuo Handa wrote:
> On 2019/08/28 19:32, Michal Hocko wrote:
> >> Speak of my cases, those who take care of their systems are not developers.
> >> And they afraid changing code that runs in kernel mode. They unlikely give
> >> permission to install SystemTap/eBPF scripts. As a result, in many cases,
> >> the root cause cannot be identified.
> >
> > Which is something I would call a process problem more than a kernel
> > one. Really if you need to debug a problem you really have to trust
> > those who can debug that for you. We are not going to take tons of code
> > to the kernel just because somebody is afraid to run a diagnostic.
> >
>
> This is a problem of kernel development process.

I disagree. Expecting that any larger project can be filled with the
(close to) _full_ and ready to use introspection built in is just
insane. We are trying to help with a generally useful information but
you simply cannot cover most existing failure paths.

> >> Moreover, we are talking about OOM situations, where we can't expect userspace
> >> processes to work properly. We need to dump information we want, without
> >> counting on userspace processes, before sending SIGKILL.
> >
> > Yes, this is an inherent assumption I was making and that means that
> > whatever dynamic hooks would have to be registered in advance.
> >
>
> No. I'm saying that neither static hooks nor dynamic hooks can work as
> expected if they count on userspace processes. Registering in advance is
> irrelevant. Whether it can work without userspace processes is relevant.

I am not saying otherwise. I do not expect any userspace process to dump
any information or read it from elswhere than from the kernel log.

> Also, out-of-tree codes tend to become defunctional. We are trying to debug
> problems caused by in-tree code. Breaking out-of-tree debugging code just
> because in-tree code developers don't want to pay the burden of maintaining
> code for debugging problems caused by in-tree code is a very bad idea.

This is a simple math of cost/benefit. The maintenance cost is not free
and paying it for odd cases most people do not care about is simply not
sustainable, we simply do not have that much of a man power.
--
Michal Hocko
SUSE Labs

2019-08-28 20:05:43

by Edward Chron

[permalink] [raw]
Subject: Re: [PATCH 00/10] OOM Debug print selection and additional information

On Wed, Aug 28, 2019 at 3:12 AM Tetsuo Handa
<[email protected]> wrote:
>
> On 2019/08/28 16:08, Michal Hocko wrote:
> > On Tue 27-08-19 19:47:22, Edward Chron wrote:
> >> For production systems installing and updating EBPF scripts may someday
> >> be very common, but I wonder how data center managers feel about it now?
> >> Developers are very excited about it and it is a very powerful tool but can I
> >> get permission to add or replace an existing EBPF on production systems?
> >
> > I am not sure I understand. There must be somebody trusted to take care
> > of systems, right?
> >
>
> Speak of my cases, those who take care of their systems are not developers.
> And they afraid changing code that runs in kernel mode. They unlikely give
> permission to install SystemTap/eBPF scripts. As a result, in many cases,
> the root cause cannot be identified.

+1. Exactly. The only thing we could think of Tetsuo is if Linux OOM Reporting
uses a an eBPF script then systems have to load them to get any kind of
meaningful report. Frankly, if using eBPF is the route to go than essentially
the whole OOM reporting should go there. We can adjust as we need and
have precedent for wanting to load the script. That's the best we could come
up with.

>
> Moreover, we are talking about OOM situations, where we can't expect userspace
> processes to work properly. We need to dump information we want, without
> counting on userspace processes, before sending SIGKILL.

+1. We've tried and as you point out and for best results the kernel
has to provide
the state.

Again a full system dump would be wonderful, but taking a full dump for
every OOM event on production systems? I am not nearly a good enough salesman
to sell that one. So we need an alternate mechanism.

If we can't agree on some sort of extensible, configurable approach then put
the standard OOM Report in eBPF and make it mandatory to load it so we can
justify having to do that. Linux should load it automatically.
We'll just make a few changes and additions as needed.

Sounds like a plan that we could live with.
Would be interested if this works for others as well.

2019-08-28 20:20:08

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH 00/10] OOM Debug print selection and additional information

On Wed, 2019-08-28 at 12:46 -0700, Edward Chron wrote:
> But with the caveat that running a eBPF script that it isn't standard Linux
> operating procedure, at this point in time any way will not be well
> received in the data center.

Can't you get your eBPF scripts into the BCC project? As far I can tell, the BCC
has been included in several distros already, and then it will become a part of
standard linux toolkits.

>
> Our belief is if you really think eBPF is the preferred mechanism
> then move OOM reporting to an eBPF. 
> I mentioned this before but I will reiterate this here.

On the other hand, it seems many people are happy with the simple kernel OOM
report we have here. Not saying the current situation is perfect. On the top of
that, some people are using kdump, and some people have resource monitoring to
warn about potential memory overcommits before OOM kicks in etc.

2019-08-28 21:18:43

by Edward Chron

[permalink] [raw]
Subject: Re: [PATCH 00/10] OOM Debug print selection and additional information

On Wed, Aug 28, 2019 at 1:18 PM Qian Cai <[email protected]> wrote:
>
> On Wed, 2019-08-28 at 12:46 -0700, Edward Chron wrote:
> > But with the caveat that running a eBPF script that it isn't standard Linux
> > operating procedure, at this point in time any way will not be well
> > received in the data center.
>
> Can't you get your eBPF scripts into the BCC project? As far I can tell, the BCC
> has been included in several distros already, and then it will become a part of
> standard linux toolkits.
>
> >
> > Our belief is if you really think eBPF is the preferred mechanism
> > then move OOM reporting to an eBPF.
> > I mentioned this before but I will reiterate this here.
>
> On the other hand, it seems many people are happy with the simple kernel OOM
> report we have here. Not saying the current situation is perfect. On the top of
> that, some people are using kdump, and some people have resource monitoring to
> warn about potential memory overcommits before OOM kicks in etc.

Assuming you can implement your existing report in eBPF then those who like the
current output would still get the current output. Same with the patches we sent
upstream, nothing in the report changes by default. So no problems for those who
are happy, they'll still be happy.

2019-08-28 21:36:37

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH 00/10] OOM Debug print selection and additional information

On Wed, 2019-08-28 at 14:17 -0700, Edward Chron wrote:
> On Wed, Aug 28, 2019 at 1:18 PM Qian Cai <[email protected]> wrote:
> >
> > On Wed, 2019-08-28 at 12:46 -0700, Edward Chron wrote:
> > > But with the caveat that running a eBPF script that it isn't standard
> > > Linux
> > > operating procedure, at this point in time any way will not be well
> > > received in the data center.
> >
> > Can't you get your eBPF scripts into the BCC project? As far I can tell, the
> > BCC
> > has been included in several distros already, and then it will become a part
> > of
> > standard linux toolkits.
> >
> > >
> > > Our belief is if you really think eBPF is the preferred mechanism
> > > then move OOM reporting to an eBPF.
> > > I mentioned this before but I will reiterate this here.
> >
> > On the other hand, it seems many people are happy with the simple kernel OOM
> > report we have here. Not saying the current situation is perfect. On the top
> > of
> > that, some people are using kdump, and some people have resource monitoring
> > to
> > warn about potential memory overcommits before OOM kicks in etc.
>
> Assuming you can implement your existing report in eBPF then those who like
> the
> current output would still get the current output. Same with the patches we
> sent
> upstream, nothing in the report changes by default. So no problems for those
> who
> are happy, they'll still be happy.

I don't think it makes any sense to rewrite the existing code to depends on eBPF
though.

2019-08-29 03:33:03

by Edward Chron

[permalink] [raw]
Subject: Re: [PATCH 00/10] OOM Debug print selection and additional information

On Wed, Aug 28, 2019 at 1:04 PM Edward Chron <[email protected]> wrote:
>
> On Wed, Aug 28, 2019 at 3:12 AM Tetsuo Handa
> <[email protected]> wrote:
> >
> > On 2019/08/28 16:08, Michal Hocko wrote:
> > > On Tue 27-08-19 19:47:22, Edward Chron wrote:
> > >> For production systems installing and updating EBPF scripts may someday
> > >> be very common, but I wonder how data center managers feel about it now?
> > >> Developers are very excited about it and it is a very powerful tool but can I
> > >> get permission to add or replace an existing EBPF on production systems?
> > >
> > > I am not sure I understand. There must be somebody trusted to take care
> > > of systems, right?
> > >
> >
> > Speak of my cases, those who take care of their systems are not developers.
> > And they afraid changing code that runs in kernel mode. They unlikely give
> > permission to install SystemTap/eBPF scripts. As a result, in many cases,
> > the root cause cannot be identified.
>
> +1. Exactly. The only thing we could think of Tetsuo is if Linux OOM Reporting
> uses a an eBPF script then systems have to load them to get any kind of
> meaningful report. Frankly, if using eBPF is the route to go than essentially
> the whole OOM reporting should go there. We can adjust as we need and
> have precedent for wanting to load the script. That's the best we could come
> up with.
>
> >
> > Moreover, we are talking about OOM situations, where we can't expect userspace
> > processes to work properly. We need to dump information we want, without
> > counting on userspace processes, before sending SIGKILL.
>
> +1. We've tried and as you point out and for best results the kernel
> has to provide
> the state.
>
> Again a full system dump would be wonderful, but taking a full dump for
> every OOM event on production systems? I am not nearly a good enough salesman
> to sell that one. So we need an alternate mechanism.
>
> If we can't agree on some sort of extensible, configurable approach then put
> the standard OOM Report in eBPF and make it mandatory to load it so we can
> justify having to do that. Linux should load it automatically.
> We'll just make a few changes and additions as needed.
>
> Sounds like a plan that we could live with.
> Would be interested if this works for others as well.

One further comment. In talking with my colleagues here who know eBPF
much better
than I do, it may not be possible to implement something this
complicated with eBPF.

If that is in the fact the case, then we'd have to try and hook the
OOM Reporting code
with tracepoints similar to kprobes only we want to do more than add counters
we want to change the flow to skip small output entries that aren't
worth printing.
If this isn't feasible with eBPF, then some derivative or our approach
or enhancing
the OOM output code directly seem like the best options. Will have to
investigate
this further.

2019-08-29 07:14:04

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 00/10] OOM Debug print selection and additional information

On Wed 28-08-19 12:46:20, Edward Chron wrote:
[...]
> Our belief is if you really think eBPF is the preferred mechanism
> then move OOM reporting to an eBPF.

I've said that all this additional information has to be dynamically
extensible rather than a part of the core kernel. Whether eBPF is the
suitable tool, I do not know. I haven't explored that. There are other
ways to inject code to the kernel. systemtap/kprobes, kernel modules and
probably others.

> I mentioned this before but I will reiterate this here.
>
> So how do we get there? Let's look at the existing report which we know
> has issues.
>
> Other than a few essential OOM messages the OOM code should produce,
> such as the Killed process message message sequence being included,
> you could have the entire OOM report moved to an eBPF script and
> therefore make it customizable, configurable or if you prefer programmable.

I believe we should keep the current reporting in place and allow
additional information via dynamic mechanism. Be it a registration
mechanism that modules can hook into or other more dynamic way.
The current reporting has proven to be useful in many typical oom
situations in my past years of experience. It gives the rough state of
the failing allocation, MM subsystem, tasks that are eligible and task
that is killed so that you can understand why the event happened.

I would argue that the eligible tasks should be printed on the opt-in
bases because this is more of relict from the past when the victim
selection was less deterministic. But that is another story.

All the rest of dump_header should stay IMHO as a reasonable default and
bare minimum.

> Why? Because as we all agree, you'll never have a perfect OOM Report.
> So if you believe this, than if you will, put your money where your mouth
> is (so to speak) and make the entire OOM Report and eBPF script.
> We'd be willing to help with this.
>
> I'll give specific reasons why you want to do this.
>
> - Don't want to maintain a lot of code in the kernel (eBPF code doesn't
> count).
> - Can't produce an ideal OOM report.
> - Don't like configuring things but favor programmatic solutions.
> - Agree the existing OOM report doesn't work for all environments.
> - Want to allow flexibility but can't support everything people might
> want.
> - Then installing an eBPF for OOM Reporting isn't an option, it's
> required.

This is going into an extreme. We cannot serve all cases but that is
true for any other heuristics/reporting in the kernel. We do care about
most.

> The last reason is huge for people who live in a world with large data
> centers. Data center managers are very conservative. They don't want to
> deviate from standard operating procedure unless absolutely necessary.
> If loading an OOM Report eBPF is standard to get OOM Reporting output,
> then they'll accept that.

I have already responded to this kind of argumentation elsewhere. This
is not a relevant argument for any kernel implementation. This is a data
process management process.

--
Michal Hocko
SUSE Labs

2019-08-29 10:17:25

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH 00/10] OOM Debug print selection and additional information

On 2019/08/29 16:11, Michal Hocko wrote:
> On Wed 28-08-19 12:46:20, Edward Chron wrote:
>> Our belief is if you really think eBPF is the preferred mechanism
>> then move OOM reporting to an eBPF.
>
> I've said that all this additional information has to be dynamically
> extensible rather than a part of the core kernel. Whether eBPF is the
> suitable tool, I do not know. I haven't explored that. There are other
> ways to inject code to the kernel. systemtap/kprobes, kernel modules and
> probably others.

As for SystemTap, guru mode (an expert mode which disables protection provided
by SystemTap; allowing kernel to crash when something went wrong) could be used
for holding spinlock. However, as far as I know, holding mutex (or doing any
operation that might sleep) from such dynamic hooks is not allowed. Also we will
need to export various symbols in order to allow access from such dynamic hooks.

I'm not familiar with eBPF, but I guess that eBPF is similar.

But please be aware that, I REPEAT AGAIN, I don't think neither eBPF nor
SystemTap will be suitable for dumping OOM information. OOM situation means
that even single page fault event cannot complete, and temporary memory
allocation for reading from kernel or writing to files cannot complete.

Therefore, we will need to hold all information in kernel memory (without
allocating any memory when OOM event happened). Dynamic hooks could hold
a few lines of output, but not all lines we want. The only possible buffer
which is preallocated and large enough would be printk()'s buffer. Thus,
I believe that we will have to use printk() in order to dump OOM information.
At that point,

static bool (*oom_handler)(struct oom_control *oc) = default_oom_killer;

bool out_of_memory(struct oom_control *oc)
{
return oom_handler(oc);
}

and let in-tree kernel modules override current OOM killer would be
the only practical choice (if we refuse adding many knobs).

2019-08-29 11:57:34

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 00/10] OOM Debug print selection and additional information

On Thu 29-08-19 19:14:46, Tetsuo Handa wrote:
> On 2019/08/29 16:11, Michal Hocko wrote:
> > On Wed 28-08-19 12:46:20, Edward Chron wrote:
> >> Our belief is if you really think eBPF is the preferred mechanism
> >> then move OOM reporting to an eBPF.
> >
> > I've said that all this additional information has to be dynamically
> > extensible rather than a part of the core kernel. Whether eBPF is the
> > suitable tool, I do not know. I haven't explored that. There are other
> > ways to inject code to the kernel. systemtap/kprobes, kernel modules and
> > probably others.
>
> As for SystemTap, guru mode (an expert mode which disables protection provided
> by SystemTap; allowing kernel to crash when something went wrong) could be used
> for holding spinlock. However, as far as I know, holding mutex (or doing any
> operation that might sleep) from such dynamic hooks is not allowed. Also we will
> need to export various symbols in order to allow access from such dynamic hooks.

This is the oom path and it should better not use any sleeping locks in
the first place.

> I'm not familiar with eBPF, but I guess that eBPF is similar.
>
> But please be aware that, I REPEAT AGAIN, I don't think neither eBPF nor
> SystemTap will be suitable for dumping OOM information. OOM situation means
> that even single page fault event cannot complete, and temporary memory
> allocation for reading from kernel or writing to files cannot complete.

And I repeat that no such reporting is going to write to files. This is
an OOM path afterall.

> Therefore, we will need to hold all information in kernel memory (without
> allocating any memory when OOM event happened). Dynamic hooks could hold
> a few lines of output, but not all lines we want. The only possible buffer
> which is preallocated and large enough would be printk()'s buffer. Thus,
> I believe that we will have to use printk() in order to dump OOM information.
> At that point,

Yes, this is what I've had in mind.

>
> static bool (*oom_handler)(struct oom_control *oc) = default_oom_killer;
>
> bool out_of_memory(struct oom_control *oc)
> {
> return oom_handler(oc);
> }
>
> and let in-tree kernel modules override current OOM killer would be
> the only practical choice (if we refuse adding many knobs).

Or simply provide a hook with the oom_control to be called to report
without replacing the whole oom killer behavior. That is not necessary.
--
Michal Hocko
SUSE Labs

2019-08-29 14:10:54

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH 00/10] OOM Debug print selection and additional information

On 2019/08/29 20:56, Michal Hocko wrote:
>> But please be aware that, I REPEAT AGAIN, I don't think neither eBPF nor
>> SystemTap will be suitable for dumping OOM information. OOM situation means
>> that even single page fault event cannot complete, and temporary memory
>> allocation for reading from kernel or writing to files cannot complete.
>
> And I repeat that no such reporting is going to write to files. This is
> an OOM path afterall.

The process who fetches from e.g. eBPF event cannot involve page fault.
The front-end for iovisor/bcc is a python userspace process. But I think
that such process can't run under OOM situation.

>
>> Therefore, we will need to hold all information in kernel memory (without
>> allocating any memory when OOM event happened). Dynamic hooks could hold
>> a few lines of output, but not all lines we want. The only possible buffer
>> which is preallocated and large enough would be printk()'s buffer. Thus,
>> I believe that we will have to use printk() in order to dump OOM information.
>> At that point,
>
> Yes, this is what I've had in mind.

Probably I incorrectly shortcut.

Dynamic hooks could hold a few lines of output, but dynamic hooks can not hold
all lines when dump_tasks() reports 32000+ processes. We have to buffer all output
in kernel memory because we can't complete even a page fault event triggered by
the python process monitoring eBPF event (and writing the result to some log file
or something) while out_of_memory() is in flight.

And "set /proc/sys/vm/oom_dump_tasks to 0" is not the right reaction. What I'm
saying is "we won't be able to hold output from dump_tasks() if output from
dump_tasks() goes to buffer preallocated for dynamic hooks". We have to find
a way that can handle the worst case.

2019-08-29 15:05:16

by Edward Chron

[permalink] [raw]
Subject: Re: [PATCH 00/10] OOM Debug print selection and additional information

On Thu, Aug 29, 2019 at 4:56 AM Michal Hocko <[email protected]> wrote:
>
> On Thu 29-08-19 19:14:46, Tetsuo Handa wrote:
> > On 2019/08/29 16:11, Michal Hocko wrote:
> > > On Wed 28-08-19 12:46:20, Edward Chron wrote:
> > >> Our belief is if you really think eBPF is the preferred mechanism
> > >> then move OOM reporting to an eBPF.
> > >
> > > I've said that all this additional information has to be dynamically
> > > extensible rather than a part of the core kernel. Whether eBPF is the
> > > suitable tool, I do not know. I haven't explored that. There are other
> > > ways to inject code to the kernel. systemtap/kprobes, kernel modules and
> > > probably others.
> >
> > As for SystemTap, guru mode (an expert mode which disables protection provided
> > by SystemTap; allowing kernel to crash when something went wrong) could be used
> > for holding spinlock. However, as far as I know, holding mutex (or doing any
> > operation that might sleep) from such dynamic hooks is not allowed. Also we will
> > need to export various symbols in order to allow access from such dynamic hooks.
>
> This is the oom path and it should better not use any sleeping locks in
> the first place.
>
> > I'm not familiar with eBPF, but I guess that eBPF is similar.
> >
> > But please be aware that, I REPEAT AGAIN, I don't think neither eBPF nor
> > SystemTap will be suitable for dumping OOM information. OOM situation means
> > that even single page fault event cannot complete, and temporary memory
> > allocation for reading from kernel or writing to files cannot complete.
>
> And I repeat that no such reporting is going to write to files. This is
> an OOM path afterall.
>
> > Therefore, we will need to hold all information in kernel memory (without
> > allocating any memory when OOM event happened). Dynamic hooks could hold
> > a few lines of output, but not all lines we want. The only possible buffer
> > which is preallocated and large enough would be printk()'s buffer. Thus,
> > I believe that we will have to use printk() in order to dump OOM information.
> > At that point,
>
> Yes, this is what I've had in mind.
>

+1: It makes sense to keep the report going to the dmesg to persist.
That is where it has always gone and there is no reason to change.
You can have several OOMs back to back and you'd like to retain the output.
All the information should be kept together in the OOM report.

> >
> > static bool (*oom_handler)(struct oom_control *oc) = default_oom_killer;
> >
> > bool out_of_memory(struct oom_control *oc)
> > {
> > return oom_handler(oc);
> > }
> >
> > and let in-tree kernel modules override current OOM killer would be
> > the only practical choice (if we refuse adding many knobs).
>
> Or simply provide a hook with the oom_control to be called to report
> without replacing the whole oom killer behavior. That is not necessary.

For very simple addition, to add a line of output this works.
It would still be nice to address the fact the existing OOM Report prints
all of the user processes or none. It would be nice to add some control
for that. That's what we did.

> --
> Michal Hocko
> SUSE Labs

2019-08-29 15:21:45

by Edward Chron

[permalink] [raw]
Subject: Re: [PATCH 00/10] OOM Debug print selection and additional information

On Thu, Aug 29, 2019 at 12:11 AM Michal Hocko <[email protected]> wrote:
>
> On Wed 28-08-19 12:46:20, Edward Chron wrote:
> [...]
> > Our belief is if you really think eBPF is the preferred mechanism
> > then move OOM reporting to an eBPF.
>
> I've said that all this additional information has to be dynamically
> extensible rather than a part of the core kernel. Whether eBPF is the
> suitable tool, I do not know. I haven't explored that. There are other
> ways to inject code to the kernel. systemtap/kprobes, kernel modules and
> probably others.

For simple code injections eBPF or kprobe works and a tracepoint would
help with that. For example we could add our one line of task information
that we find very useful this way.

For adding controls to limit output for processes, slabs and vmalloc entries
it would be harder to inject code for that. Our solution was to use debugfs.
An alternate could to be add simple sysctl if using debugfs is not appropriate.
As our code illustrated this can be added without changing the existing report
in any substantive way. I think there is value in this and this is core to what
the OOM report should provide. Additional items may be add ons that are
environment specific but these are OOM reporting essentials IMHO.

>
> > I mentioned this before but I will reiterate this here.
> >
> > So how do we get there? Let's look at the existing report which we know
> > has issues.
> >
> > Other than a few essential OOM messages the OOM code should produce,
> > such as the Killed process message message sequence being included,
> > you could have the entire OOM report moved to an eBPF script and
> > therefore make it customizable, configurable or if you prefer programmable.
>
> I believe we should keep the current reporting in place and allow
> additional information via dynamic mechanism. Be it a registration
> mechanism that modules can hook into or other more dynamic way.
> The current reporting has proven to be useful in many typical oom
> situations in my past years of experience. It gives the rough state of
> the failing allocation, MM subsystem, tasks that are eligible and task
> that is killed so that you can understand why the event happened.
>
> I would argue that the eligible tasks should be printed on the opt-in
> bases because this is more of relict from the past when the victim
> selection was less deterministic. But that is another story.
>
> All the rest of dump_header should stay IMHO as a reasonable default and
> bare minimum.
>
> > Why? Because as we all agree, you'll never have a perfect OOM Report.
> > So if you believe this, than if you will, put your money where your mouth
> > is (so to speak) and make the entire OOM Report and eBPF script.
> > We'd be willing to help with this.
> >
> > I'll give specific reasons why you want to do this.
> >
> > - Don't want to maintain a lot of code in the kernel (eBPF code doesn't
> > count).
> > - Can't produce an ideal OOM report.
> > - Don't like configuring things but favor programmatic solutions.
> > - Agree the existing OOM report doesn't work for all environments.
> > - Want to allow flexibility but can't support everything people might
> > want.
> > - Then installing an eBPF for OOM Reporting isn't an option, it's
> > required.
>
> This is going into an extreme. We cannot serve all cases but that is
> true for any other heuristics/reporting in the kernel. We do care about
> most.

Unfortunately my argument for this is moot, this can't be done with
eBPF, at least not now.

>
> > The last reason is huge for people who live in a world with large data
> > centers. Data center managers are very conservative. They don't want to
> > deviate from standard operating procedure unless absolutely necessary.
> > If loading an OOM Report eBPF is standard to get OOM Reporting output,
> > then they'll accept that.
>
> I have already responded to this kind of argumentation elsewhere. This
> is not a relevant argument for any kernel implementation. This is a data
> process management process.
>
> --
> Michal Hocko
> SUSE Labs

2019-08-29 15:43:53

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH 00/10] OOM Debug print selection and additional information

On Thu, 2019-08-29 at 08:03 -0700, Edward Chron wrote:
> On Thu, Aug 29, 2019 at 4:56 AM Michal Hocko <[email protected]> wrote:
> >
> > On Thu 29-08-19 19:14:46, Tetsuo Handa wrote:
> > > On 2019/08/29 16:11, Michal Hocko wrote:
> > > > On Wed 28-08-19 12:46:20, Edward Chron wrote:
> > > > > Our belief is if you really think eBPF is the preferred mechanism
> > > > > then move OOM reporting to an eBPF.
> > > >
> > > > I've said that all this additional information has to be dynamically
> > > > extensible rather than a part of the core kernel. Whether eBPF is the
> > > > suitable tool, I do not know. I haven't explored that. There are other
> > > > ways to inject code to the kernel. systemtap/kprobes, kernel modules and
> > > > probably others.
> > >
> > > As for SystemTap, guru mode (an expert mode which disables protection
> > > provided
> > > by SystemTap; allowing kernel to crash when something went wrong) could be
> > > used
> > > for holding spinlock. However, as far as I know, holding mutex (or doing
> > > any
> > > operation that might sleep) from such dynamic hooks is not allowed. Also
> > > we will
> > > need to export various symbols in order to allow access from such dynamic
> > > hooks.
> >
> > This is the oom path and it should better not use any sleeping locks in
> > the first place.
> >
> > > I'm not familiar with eBPF, but I guess that eBPF is similar.
> > >
> > > But please be aware that, I REPEAT AGAIN, I don't think neither eBPF nor
> > > SystemTap will be suitable for dumping OOM information. OOM situation
> > > means
> > > that even single page fault event cannot complete, and temporary memory
> > > allocation for reading from kernel or writing to files cannot complete.
> >
> > And I repeat that no such reporting is going to write to files. This is
> > an OOM path afterall.
> >
> > > Therefore, we will need to hold all information in kernel memory (without
> > > allocating any memory when OOM event happened). Dynamic hooks could hold
> > > a few lines of output, but not all lines we want. The only possible buffer
> > > which is preallocated and large enough would be printk()'s buffer. Thus,
> > > I believe that we will have to use printk() in order to dump OOM
> > > information.
> > > At that point,
> >
> > Yes, this is what I've had in mind.
> >
>
> +1: It makes sense to keep the report going to the dmesg to persist.
> That is where it has always gone and there is no reason to change.
> You can have several OOMs back to back and you'd like to retain the output.
> All the information should be kept together in the OOM report.
>
> > >
> > >   static bool (*oom_handler)(struct oom_control *oc) = default_oom_killer;
> > >
> > >   bool out_of_memory(struct oom_control *oc)
> > >   {
> > >           return oom_handler(oc);
> > >   }
> > >
> > > and let in-tree kernel modules override current OOM killer would be
> > > the only practical choice (if we refuse adding many knobs).
> >
> > Or simply provide a hook with the oom_control to be called to report
> > without replacing the whole oom killer behavior. That is not necessary.
>
> For very simple addition, to add a line of output this works.
> It would still be nice to address the fact the existing OOM Report prints
> all of the user processes or none. It would be nice to add some control
> for that. That's what we did.

Feel like you are going in circles to "sell" without any new information. If you
need to deal with OOM that often, it might also worth working with FB on oomd.

https://github.com/facebookincubator/oomd

It is well-known that kernel OOM could be slow and painful to deal with, so I
don't buy-in the argument that kernel OOM recover is better/faster than a kdump
reboot.

It is not unusual that when the system is triggering a kernel OOM, it is almost
trashed/dead. Although developers are working hard to improve the recovery after
OOM, there are still many error-paths that are not going to survive which would
leak memories, introduce undefined behaviors, corrupt memory etc.

2019-08-29 15:50:53

by Edward Chron

[permalink] [raw]
Subject: Re: [PATCH 00/10] OOM Debug print selection and additional information

On Thu, Aug 29, 2019 at 7:09 AM Tetsuo Handa
<[email protected]> wrote:
>
> On 2019/08/29 20:56, Michal Hocko wrote:
> >> But please be aware that, I REPEAT AGAIN, I don't think neither eBPF nor
> >> SystemTap will be suitable for dumping OOM information. OOM situation means
> >> that even single page fault event cannot complete, and temporary memory
> >> allocation for reading from kernel or writing to files cannot complete.
> >
> > And I repeat that no such reporting is going to write to files. This is
> > an OOM path afterall.
>
> The process who fetches from e.g. eBPF event cannot involve page fault.
> The front-end for iovisor/bcc is a python userspace process. But I think
> that such process can't run under OOM situation.
>
> >
> >> Therefore, we will need to hold all information in kernel memory (without
> >> allocating any memory when OOM event happened). Dynamic hooks could hold
> >> a few lines of output, but not all lines we want. The only possible buffer
> >> which is preallocated and large enough would be printk()'s buffer. Thus,
> >> I believe that we will have to use printk() in order to dump OOM information.
> >> At that point,
> >
> > Yes, this is what I've had in mind.
>
> Probably I incorrectly shortcut.
>
> Dynamic hooks could hold a few lines of output, but dynamic hooks can not hold
> all lines when dump_tasks() reports 32000+ processes. We have to buffer all output
> in kernel memory because we can't complete even a page fault event triggered by
> the python process monitoring eBPF event (and writing the result to some log file
> or something) while out_of_memory() is in flight.
>
> And "set /proc/sys/vm/oom_dump_tasks to 0" is not the right reaction. What I'm
> saying is "we won't be able to hold output from dump_tasks() if output from
> dump_tasks() goes to buffer preallocated for dynamic hooks". We have to find
> a way that can handle the worst case.

With the patch series we sent the addition of vmalloc entries print
required us to
add a small piece of code to vmalloc.c but we thought this should be core
OOM Reporting function. However you want to limit which vmalloc entries you
print, probably only very large memory users. For us this generates just a few
entries and has proven useful.

The changes to limit how many processes get printed so you don't have the all
or nothing would be nice to have. It would be easiest if there was a standard
mechanism to specify which entries to print, probably by a minimum size which
is what we did. We used debugfs to set the controls but sysctl or some other
mechanism could be used.

The rest of what we did might be implemented with hooks as they only output
a line or two and I've already got rid of information we had that was
redundant.

2019-08-29 16:11:16

by Edward Chron

[permalink] [raw]
Subject: Re: [PATCH 00/10] OOM Debug print selection and additional information

On Thu, Aug 29, 2019 at 8:42 AM Qian Cai <[email protected]> wrote:
>
> On Thu, 2019-08-29 at 08:03 -0700, Edward Chron wrote:
> > On Thu, Aug 29, 2019 at 4:56 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Thu 29-08-19 19:14:46, Tetsuo Handa wrote:
> > > > On 2019/08/29 16:11, Michal Hocko wrote:
> > > > > On Wed 28-08-19 12:46:20, Edward Chron wrote:
> > > > > > Our belief is if you really think eBPF is the preferred mechanism
> > > > > > then move OOM reporting to an eBPF.
> > > > >
> > > > > I've said that all this additional information has to be dynamically
> > > > > extensible rather than a part of the core kernel. Whether eBPF is the
> > > > > suitable tool, I do not know. I haven't explored that. There are other
> > > > > ways to inject code to the kernel. systemtap/kprobes, kernel modules and
> > > > > probably others.
> > > >
> > > > As for SystemTap, guru mode (an expert mode which disables protection
> > > > provided
> > > > by SystemTap; allowing kernel to crash when something went wrong) could be
> > > > used
> > > > for holding spinlock. However, as far as I know, holding mutex (or doing
> > > > any
> > > > operation that might sleep) from such dynamic hooks is not allowed. Also
> > > > we will
> > > > need to export various symbols in order to allow access from such dynamic
> > > > hooks.
> > >
> > > This is the oom path and it should better not use any sleeping locks in
> > > the first place.
> > >
> > > > I'm not familiar with eBPF, but I guess that eBPF is similar.
> > > >
> > > > But please be aware that, I REPEAT AGAIN, I don't think neither eBPF nor
> > > > SystemTap will be suitable for dumping OOM information. OOM situation
> > > > means
> > > > that even single page fault event cannot complete, and temporary memory
> > > > allocation for reading from kernel or writing to files cannot complete.
> > >
> > > And I repeat that no such reporting is going to write to files. This is
> > > an OOM path afterall.
> > >
> > > > Therefore, we will need to hold all information in kernel memory (without
> > > > allocating any memory when OOM event happened). Dynamic hooks could hold
> > > > a few lines of output, but not all lines we want. The only possible buffer
> > > > which is preallocated and large enough would be printk()'s buffer. Thus,
> > > > I believe that we will have to use printk() in order to dump OOM
> > > > information.
> > > > At that point,
> > >
> > > Yes, this is what I've had in mind.
> > >
> >
> > +1: It makes sense to keep the report going to the dmesg to persist.
> > That is where it has always gone and there is no reason to change.
> > You can have several OOMs back to back and you'd like to retain the output.
> > All the information should be kept together in the OOM report.
> >
> > > >
> > > > static bool (*oom_handler)(struct oom_control *oc) = default_oom_killer;
> > > >
> > > > bool out_of_memory(struct oom_control *oc)
> > > > {
> > > > return oom_handler(oc);
> > > > }
> > > >
> > > > and let in-tree kernel modules override current OOM killer would be
> > > > the only practical choice (if we refuse adding many knobs).
> > >
> > > Or simply provide a hook with the oom_control to be called to report
> > > without replacing the whole oom killer behavior. That is not necessary.
> >
> > For very simple addition, to add a line of output this works.
> > It would still be nice to address the fact the existing OOM Report prints
> > all of the user processes or none. It would be nice to add some control
> > for that. That's what we did.
>
> Feel like you are going in circles to "sell" without any new information. If you
> need to deal with OOM that often, it might also worth working with FB on oomd.
>
> https://github.com/facebookincubator/oomd
>
> It is well-known that kernel OOM could be slow and painful to deal with, so I
> don't buy-in the argument that kernel OOM recover is better/faster than a kdump
> reboot.
>
> It is not unusual that when the system is triggering a kernel OOM, it is almost
> trashed/dead. Although developers are working hard to improve the recovery after
> OOM, there are still many error-paths that are not going to survive which would
> leak memories, introduce undefined behaviors, corrupt memory etc.

But as you have pointed out many people are happy with current OOM processing
which is the report and recovery so for those people a kdump reboot is overkill.
Making the OOM report at least optionally a bit more informative has value. Also
making sure it doesn't produce excessive output is desirable.

I do agree for developers having to have all the system state a kdump
provides that
and as long as you can reproduce the OOM event that works well. But
that is not the
common case as has already been discussed.

Also, OOM events that are due to kernel bugs could leak memory and over time
and cause a crash, true. But that is not what we typically see. In
fact we've had
customers come back and report issues on systems that have been in continuous
operation for years. No point in crashing their system. Linux if
properly maintained
is thankfully quite stable. But OOMs do happen and root causing them to prevent
future occurrences is desired.

2019-08-29 16:19:16

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 00/10] OOM Debug print selection and additional information

On Thu 29-08-19 08:03:19, Edward Chron wrote:
> On Thu, Aug 29, 2019 at 4:56 AM Michal Hocko <[email protected]> wrote:
[...]
> > Or simply provide a hook with the oom_control to be called to report
> > without replacing the whole oom killer behavior. That is not necessary.
>
> For very simple addition, to add a line of output this works.

Why would a hook be limited to small stuff?

> It would still be nice to address the fact the existing OOM Report prints
> all of the user processes or none. It would be nice to add some control
> for that. That's what we did.

TBH, I am not really convinced partial taks list is desirable nor easy
to configure. What is the criterion? oom_score (with potentially unstable
metric)? Rss? Something else?
--
Michal Hocko
SUSE Labs

2019-08-29 16:36:43

by Edward Chron

[permalink] [raw]
Subject: Re: [PATCH 00/10] OOM Debug print selection and additional information

On Thu, Aug 29, 2019 at 9:18 AM Michal Hocko <[email protected]> wrote:
>
> On Thu 29-08-19 08:03:19, Edward Chron wrote:
> > On Thu, Aug 29, 2019 at 4:56 AM Michal Hocko <[email protected]> wrote:
> [...]
> > > Or simply provide a hook with the oom_control to be called to report
> > > without replacing the whole oom killer behavior. That is not necessary.
> >
> > For very simple addition, to add a line of output this works.
>
> Why would a hook be limited to small stuff?

It could be larger but the few items we added were just a line or
two of output.

The vmalloc, slabs and processes can print many entries so we
added a control for those.

>
> > It would still be nice to address the fact the existing OOM Report prints
> > all of the user processes or none. It would be nice to add some control
> > for that. That's what we did.
>
> TBH, I am not really convinced partial taks list is desirable nor easy
> to configure. What is the criterion? oom_score (with potentially unstable
> metric)? Rss? Something else?

We used an estimate of the memory footprint of the process:
rss, swap pages and page table pages.

> --
> Michal Hocko
> SUSE Labs

2019-08-29 18:45:37

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH 00/10] OOM Debug print selection and additional information

On Thu, 2019-08-29 at 09:09 -0700, Edward Chron wrote:

> > Feel like you are going in circles to "sell" without any new information. If
> > you
> > need to deal with OOM that often, it might also worth working with FB on
> > oomd.
> >
> > https://github.com/facebookincubator/oomd
> >
> > It is well-known that kernel OOM could be slow and painful to deal with, so
> > I
> > don't buy-in the argument that kernel OOM recover is better/faster than a
> > kdump
> > reboot.
> >
> > It is not unusual that when the system is triggering a kernel OOM, it is
> > almost
> > trashed/dead. Although developers are working hard to improve the recovery
> > after
> > OOM, there are still many error-paths that are not going to survive which
> > would
> > leak memories, introduce undefined behaviors, corrupt memory etc.
>
> But as you have pointed out many people are happy with current OOM processing
> which is the report and recovery so for those people a kdump reboot is
> overkill.
> Making the OOM report at least optionally a bit more informative has value.
> Also
> making sure it doesn't produce excessive output is desirable.
>
> I do agree for developers having to have all the system state a kdump
> provides that
> and as long as you can reproduce the OOM event that works well. But
> that is not the
> common case as has already been discussed.
>
> Also, OOM events that are due to kernel bugs could leak memory and over time
> and cause a crash, true. But that is not what we typically see. In
> fact we've had
> customers come back and report issues on systems that have been in continuous
> operation for years. No point in crashing their system. Linux if
> properly maintained
> is thankfully quite stable. But OOMs do happen and root causing them to
> prevent
> future occurrences is desired.

This is not what I meant. After an OOM event happens, many kernel memory
allocations could fail. Since very few people are testing those error-paths due
to allocation failures, it is considered one of those most buggy areas in the
kernel. Developers have mostly been focus on making sure the kernel OOM should
not happen in the first place.

I still think the time is better spending on improving things like eBPF, oomd
and kdump etc to solve your problem, but leave the kernel OOM report code alone.

2019-08-29 22:45:06

by Edward Chron

[permalink] [raw]
Subject: Re: [PATCH 00/10] OOM Debug print selection and additional information

On Thu, Aug 29, 2019 at 11:44 AM Qian Cai <[email protected]> wrote:
>
> On Thu, 2019-08-29 at 09:09 -0700, Edward Chron wrote:
>
> > > Feel like you are going in circles to "sell" without any new information. If
> > > you
> > > need to deal with OOM that often, it might also worth working with FB on
> > > oomd.
> > >
> > > https://github.com/facebookincubator/oomd
> > >
> > > It is well-known that kernel OOM could be slow and painful to deal with, so
> > > I
> > > don't buy-in the argument that kernel OOM recover is better/faster than a
> > > kdump
> > > reboot.
> > >
> > > It is not unusual that when the system is triggering a kernel OOM, it is
> > > almost
> > > trashed/dead. Although developers are working hard to improve the recovery
> > > after
> > > OOM, there are still many error-paths that are not going to survive which
> > > would
> > > leak memories, introduce undefined behaviors, corrupt memory etc.
> >
> > But as you have pointed out many people are happy with current OOM processing
> > which is the report and recovery so for those people a kdump reboot is
> > overkill.
> > Making the OOM report at least optionally a bit more informative has value.
> > Also
> > making sure it doesn't produce excessive output is desirable.
> >
> > I do agree for developers having to have all the system state a kdump
> > provides that
> > and as long as you can reproduce the OOM event that works well. But
> > that is not the
> > common case as has already been discussed.
> >
> > Also, OOM events that are due to kernel bugs could leak memory and over time
> > and cause a crash, true. But that is not what we typically see. In
> > fact we've had
> > customers come back and report issues on systems that have been in continuous
> > operation for years. No point in crashing their system. Linux if
> > properly maintained
> > is thankfully quite stable. But OOMs do happen and root causing them to
> > prevent
> > future occurrences is desired.
>
> This is not what I meant. After an OOM event happens, many kernel memory
> allocations could fail. Since very few people are testing those error-paths due
> to allocation failures, it is considered one of those most buggy areas in the
> kernel. Developers have mostly been focus on making sure the kernel OOM should
> not happen in the first place.
>
> I still think the time is better spending on improving things like eBPF, oomd
> and kdump etc to solve your problem, but leave the kernel OOM report code alone.
>

Sure would rather spend my time doing other things.
No argument about that. No one likes OOMs.
If I never see another OOM I'd be quite happy.

But OOM events still happen and an OOM report gets generated.
When it happens it is useful to get information that can help
find the cause of the OOM so it can be fixed and won't happen again.
We get tasked to root cause OOMs even though we'd rather do
other things.

We've added a bit of output to the OOM Report and it has been helpful.
We also reduce our total output by only printing larger entries
with helpful summaries.
We've been using and supporting this code for quite a few releases.
We haven't had problems and we have a lot of systems in use.

Contributing to an open source project like Linux is good.
If the code is not accepted its not the end of the world.
I was told to offer our code upstream and to try to be helpful.

I understand that processing an OOM event can be flakey.
We add a few lines of OOM output but in fact we reduce our total
output because we skip printing smaller entries and print
summaries instead.

So if the volume of the output increases the likelihood of system
failure during an OOM event, then we've actually increased our
reliability. Maybe that is why we haven't had any problems.

As far as switching from generating an OOM report to taking
a dump and restarting the system, the choice is not mine to
decide. Way above my pay grade. When asked, I am
happy to look at a dump but dumps plus restarts for
the systems we work on take too long so I typically don't get
a dump to look at. Have to make due with OOM output and
logs.

Also, and depending on what you work on, you may take
satisfaction that OOM events are far less traumatic with
newer versions of Linux, with our systems. The folks upstream
do really good work, give credit where credit is due.
Maybe tools like KASAN really help, which we also use.

Sure people fix bugs all the time, Linux is huge and super
complicated, but many of the bugs are not very common
and we spend an amazing (to me anyway) amount of time
testing and so when we take OOM events, even multiple
OOM events back to back, the system almost always
recovers and we don't seem to bleed memory. That is
why we systems up for months and even years.

Occasionally we see a watchdog timeout failure and that
can be due to a low memory situation but just FYI a fair
number of those do not involve OOM events so its not
because of issues with OOM code, reporting or otherwise.

Regardless, thank-you for your time and for your comments.
Constructive feedback is useful and certainly appreciated.

By the way we use oomd on some systems. It is helpful and
in my experience it helps to reduce OOM events but sadly
they still occur. For systems where it is not used, again that
is not my choice to make.

Edward Chron
Arista Networks